Purpose

This report documents the tree-model optimization stage of the fraud detection system.

By this point, the linear baseline had already been tuned and gave a useful, interpretable reference. It also showed a practical ceiling: fraud patterns in this dataset depend on interactions and local combinations, so performance starts to plateau when a model relies on one global decision surface.

Tree-based models were introduced to handle that structure more naturally. They can split the data into local regions and capture combinations such as:

  • short policy tenure + severe incident
  • missing documentation + unusual claim composition
  • category combinations that become risky only in certain contexts

A simple way to picture it:

  • Linear model → one long ruler across the page
  • Tree model → branching checklist with follow-up questions

This stage also compares two dataset representations because model families “see” data differently:

  • Original dataset (50 features) — broader representation
  • Trees dataset (42 features) — tree-optimized representation

The central question is straightforward:

“Which tree model and feature representation produce the strongest fraud recall under operational constraints?”
. . .

1. Why Trees Were Trained Separately from Linear Models

Different model families need different kinds of feature representation. This is a design choice, not a cosmetic one.

How linear models read data

Linear models work best when the signal has already been translated into:

  • binary indicators
  • buckets
  • explicit threshold flags
  • one-hot encoded categories

That is why the earlier linear pipeline used more manual shaping.

How tree models read data

Tree models can discover thresholds and interactions on their own. They benefit from richer structure and usually need less “flattening.” In some cases, too much bucketing removes useful detail.

That is why this project keeps a separate tree pathway and compares two representations directly.

Dataset setup used in this stage

Dataset Version Train Shape Test Shape Notes
Original (800, 50) (200, 50) Broader feature set
Trees (800, 42) (200, 42) Tree-optimized representation

Both splits were stratified, and the training data preserved the same fraud imbalance pattern (about 3.04 non-fraud per 1 fraud), so the comparison stayed fair.

. . .

2. Optimization Strategy

All candidate tree-family models were tuned using the same process, so the comparison reflects model behavior rather than inconsistent setup.

Candidate models screened and tuned

Model Family Why Included
RandomForest Strong tabular baseline, captures interactions
XGBoost Powerful boosting model, often excellent on structured data
ExtraTrees High-randomness tree ensemble, often robust on noisy tabular data
AdaBoost Sequential boosting with simpler learners
Bagging (DecisionTree base) Ensemble stability benchmark

Training and tuning setup

Component Configuration
Cross-validation 5-fold StratifiedKFold
Search method RandomizedSearchCV
Refit metric F2
Additional tracking PR-AUC
Objective priority Recall-weighted fraud triage performance

Why F2?

F2 gives more weight to recall than precision. In fraud triage:

  • missing a fraud case (false negative) usually costs more
  • reviewing one extra legitimate claim (false positive) is still costly, but often less severe

That makes F2 a practical optimization target for this stage.

// Code Snippet — Tree Tuning Pattern (Conceptual)
from sklearn.model_selection import StratifiedKFold, RandomizedSearchCV
from sklearn.metrics import make_scorer, fbeta_score
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import RobustScaler

f2_scorer = make_scorer(fbeta_score, beta=2)

cv = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)

pipe = Pipeline([
    ("scaler", RobustScaler()),
    ("model", candidate_tree_model)
])

search = RandomizedSearchCV(
    estimator=pipe,
    param_distributions=param_grid,
    n_iter=30,
    scoring={"f2": f2_scorer, "pr_auc": "average_precision"},
    refit="f2",
    cv=cv,
    random_state=42,
    n_jobs=-1
)

search.fit(X_train, y_train)
best_model = search.best_estimator_

Why this matters: The model is tuned using the metric that matches the business problem, and the pipeline keeps preprocessing inside CV folds so evaluation stays clean.

. . .

3. Cross-Validation Results

Cross-validation is the first real stress test. It checks whether a model performs consistently before the held-out test set gets involved.

“Tree-optimized feature representation improves cross-validated F2 for every tree-family model.”

What this means

The tree-optimized representation improves CV F2 across every model family. These are not tiny gains. The feature representation is changing how well trees can split the risk space.

That’s the useful lesson here: feature engineering changes the “terrain” the model navigates.

. . .

4. Best Hyperparameter Snapshot (Trees Dataset)

Hyperparameters tell you how each model prefers to behave on this fraud problem. The values below are the best-performing settings from CV on the Trees dataset.

“Top-performing tree models converged on fraud-aware settings: class weighting, controlled depth, and regularization.”

Plain-English interpretation

  • Class weighting / positive-class weighting appears repeatedly. That’s the models being told: “Fraud is rarer, pay attention.”
  • Depth limits + min_samples_leaf show up in strong models. That helps prevent memorizing noise.
  • XGBoost prefers a very low learning rate here, which means slower, more careful fitting.
. . .

5. Test-Set Results (Threshold = 0.5)

After tuning, the best models were evaluated on the held-out test set using the standard threshold of 0.5 for a clean comparison.

“ExtraTrees achieved the strongest default-threshold F2 on the held-out test set, with high recall for fraud triage.”

What stands out

  • XGBoost had the strongest CV F2 during tuning.
  • ExtraTrees delivered the strongest test-set F2 at threshold 0.5.

That’s normal. Cross-validation and test evaluation often shuffle the order slightly. CV is the rehearsal; test is opening night.

. . .

6. Output Artifacts for Next Stages

The tuned tree models were saved as uncalibrated artifacts for later stages. That separation is intentional and useful:

Stage Question It Answers
Training / Tuning Which model family and settings perform best?
Threshold Design How many claims get flagged, and how many frauds are caught?
Calibration (next) Do the predicted probabilities behave honestly?
Interpretability (later) Which features drove the score?

Keeping these stages separate makes the system easier to audit, explain, and maintain.

Minimalist abstract sequence showing six geometric stages of a machine learning model pipeline
“Tree optimization selects candidate models. Calibration and interpretability are handled as separate audit steps.”
. . .

Key Takeaway

Tree optimization is the point where the fraud pipeline becomes operationally strong.

Using separate pathways for linear and tree models was the right design decision because the model families learn differently. The tree-optimized feature representation improved cross-validated F2 across all tree families and produced stronger test performance overall.

At the default threshold, ExtraTrees delivered the strongest held-out F2 with high recall. This uncalibrated ranking artifact will be passed directly into the thresholding and calibration stages.

In short: the modeling got stronger, and the candidate models are ready for operational evaluation.