Purpose
This report documents the tree-model optimization stage of the fraud detection system.
By this point, the linear baseline had already been tuned and gave a useful, interpretable reference. It also showed a practical ceiling: fraud patterns in this dataset depend on interactions and local combinations, so performance starts to plateau when a model relies on one global decision surface.
Tree-based models were introduced to handle that structure more naturally. They can split the data into local regions and capture combinations such as:
- short policy tenure + severe incident
- missing documentation + unusual claim composition
- category combinations that become risky only in certain contexts
A simple way to picture it:
- Linear model → one long ruler across the page
- Tree model → branching checklist with follow-up questions
This stage also compares two dataset representations because model families “see” data differently:
- Original dataset (50 features) — broader representation
- Trees dataset (42 features) — tree-optimized representation
The central question is straightforward:
1. Why Trees Were Trained Separately from Linear Models
Different model families need different kinds of feature representation. This is a design choice, not a cosmetic one.
How linear models read data
Linear models work best when the signal has already been translated into:
- binary indicators
- buckets
- explicit threshold flags
- one-hot encoded categories
That is why the earlier linear pipeline used more manual shaping.
How tree models read data
Tree models can discover thresholds and interactions on their own. They benefit from richer structure and usually need less “flattening.” In some cases, too much bucketing removes useful detail.
That is why this project keeps a separate tree pathway and compares two representations directly.
Dataset setup used in this stage
| Dataset Version | Train Shape | Test Shape | Notes |
|---|---|---|---|
| Original | (800, 50) | (200, 50) | Broader feature set |
| Trees | (800, 42) | (200, 42) | Tree-optimized representation |
Both splits were stratified, and the training data preserved the same fraud imbalance pattern (about 3.04 non-fraud per 1 fraud), so the comparison stayed fair.
2. Optimization Strategy
All candidate tree-family models were tuned using the same process, so the comparison reflects model behavior rather than inconsistent setup.
Candidate models screened and tuned
| Model Family | Why Included |
|---|---|
| RandomForest | Strong tabular baseline, captures interactions |
| XGBoost | Powerful boosting model, often excellent on structured data |
| ExtraTrees | High-randomness tree ensemble, often robust on noisy tabular data |
| AdaBoost | Sequential boosting with simpler learners |
| Bagging (DecisionTree base) | Ensemble stability benchmark |
Training and tuning setup
| Component | Configuration |
|---|---|
| Cross-validation | 5-fold StratifiedKFold |
| Search method | RandomizedSearchCV |
| Refit metric | F2 |
| Additional tracking | PR-AUC |
| Objective priority | Recall-weighted fraud triage performance |
Why F2?
F2 gives more weight to recall than precision. In fraud triage:
- missing a fraud case (false negative) usually costs more
- reviewing one extra legitimate claim (false positive) is still costly, but often less severe
That makes F2 a practical optimization target for this stage.
// Code Snippet — Tree Tuning Pattern (Conceptual)
from sklearn.model_selection import StratifiedKFold, RandomizedSearchCV
from sklearn.metrics import make_scorer, fbeta_score
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import RobustScaler
f2_scorer = make_scorer(fbeta_score, beta=2)
cv = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)
pipe = Pipeline([
("scaler", RobustScaler()),
("model", candidate_tree_model)
])
search = RandomizedSearchCV(
estimator=pipe,
param_distributions=param_grid,
n_iter=30,
scoring={"f2": f2_scorer, "pr_auc": "average_precision"},
refit="f2",
cv=cv,
random_state=42,
n_jobs=-1
)
search.fit(X_train, y_train)
best_model = search.best_estimator_
Why this matters: The model is tuned using the metric that matches the business problem, and the pipeline keeps preprocessing inside CV folds so evaluation stays clean.
3. Cross-Validation Results
Cross-validation is the first real stress test. It checks whether a model performs consistently before the held-out test set gets involved.
What this means
The tree-optimized representation improves CV F2 across every model family. These are not tiny gains. The feature representation is changing how well trees can split the risk space.
That’s the useful lesson here: feature engineering changes the “terrain” the model navigates.
4. Best Hyperparameter Snapshot (Trees Dataset)
Hyperparameters tell you how each model prefers to behave on this fraud problem. The values below are the best-performing settings from CV on the Trees dataset.
Plain-English interpretation
- Class weighting / positive-class weighting appears repeatedly. That’s the models being told: “Fraud is rarer, pay attention.”
- Depth limits + min_samples_leaf show up in strong models. That helps prevent memorizing noise.
- XGBoost prefers a very low learning rate here, which means slower, more careful fitting.
5. Test-Set Results (Threshold = 0.5)
After tuning, the best models were evaluated on the held-out test set using the standard threshold of 0.5 for a clean comparison.
What stands out
- XGBoost had the strongest CV F2 during tuning.
- ExtraTrees delivered the strongest test-set F2 at threshold 0.5.
That’s normal. Cross-validation and test evaluation often shuffle the order slightly. CV is the rehearsal; test is opening night.
6. Output Artifacts for Next Stages
The tuned tree models were saved as uncalibrated artifacts for later stages. That separation is intentional and useful:
| Stage | Question It Answers |
|---|---|
| Training / Tuning | Which model family and settings perform best? |
| Threshold Design | How many claims get flagged, and how many frauds are caught? |
| Calibration (next) | Do the predicted probabilities behave honestly? |
| Interpretability (later) | Which features drove the score? |
Keeping these stages separate makes the system easier to audit, explain, and maintain.
Key Takeaway
Tree optimization is the point where the fraud pipeline becomes operationally strong.
Using separate pathways for linear and tree models was the right design decision because the model families learn differently. The tree-optimized feature representation improved cross-validated F2 across all tree families and produced stronger test performance overall.
At the default threshold, ExtraTrees delivered the strongest held-out F2 with high recall. This uncalibrated ranking artifact will be passed directly into the thresholding and calibration stages.
In short: the modeling got stronger, and the candidate models are ready for operational evaluation.