Purpose

This report documents the probability calibration stage for the tuned tree models.

By this point, the models were already good at separating suspicious claims from normal ones. The remaining issue was more subtle: the score values themselves (0.2, 0.6, 0.9, etc.) are often not “honest probabilities” out of the box for tree ensembles.

In fraud work, that matters a lot.

A score is used for more than a yes/no decision. It is also used for:

  • ranking which claims analysts review first
  • setting triage cutoffs by team capacity
  • comparing risk across batches
  • generating explainable risk narratives

If the score scale is distorted, the queue becomes messy even when the classifier still looks good on recall.

This stage calibrates the tree models so the probability scale becomes more reliable for decision support. The project also keeps both versions (uncalibrated and calibrated), because they serve different operational purposes.

The central question:

“How do we turn model scores into risk estimates that are usable in real fraud triage workflows?”
. . .

1. Why Calibration Exists at All

Tree models (RandomForest, XGBoost, ExtraTrees, AdaBoost, Bagging) often produce strong ranking performance, but their raw probability outputs can be “overconfident” or “underconfident.”

Simple version:

  • the model may rank fraud cases correctly,
  • but the number 0.80 does not always mean “about 80% fraud likelihood.”

That is exactly the problem calibration solves: it reshapes the score scale so probability values behave more consistently for ranking and triage decisions. This calibration step was explicitly introduced for that reason in your pipeline.

Plain-language metaphor

Imagine a bathroom scale that always puts heavier people above lighter people, so the ordering is fine — but it adds +7 kg to everyone. It still sorts people correctly, but the numbers are wrong.

Calibration fixes the numbers, not just the order.

That distinction is gold in fraud systems.

. . .

2. Why Calibration Matters Specifically for Fraud

Fraud detection is rarely “predict and forget.” It behaves like a triage queue:

  • Claims are sorted by risk score
  • A review team has limited capacity
  • A threshold is set depending on workload and tolerance for misses

That means two separate needs exist at the same time:

A) Capture-oriented mode (recall-heavy)

Useful when you want aggressive fraud capture and are willing to review more claims.

B) Probability-oriented mode (risk scoring / ranking)

Useful when the team depends on the score itself for prioritization, routing, or consistent downstream logic.

Your project explicitly keeps both uncalibrated and calibrated models because those two modes are real, not theoretical. Uncalibrated models were kept for aggressive threshold tuning, while calibrated versions were kept for trustworthy risk scores and probability-based decisions.

That is a very sane system design choice. Many people mash these together and wonder why operations get weird later.

. . .

3. Calibration Methods Used in This Project

Your pipeline used CalibratedClassifierCV for tree model calibration, with two calibration strategies depending on the model family:

  • XGBoost → Platt scaling (sigmoid)
  • Other tree models → Isotonic regression

What these methods do (simple explanation)

Platt scaling (sigmoid)
Platt scaling fits a smooth S-shaped curve on top of the model’s raw scores. It is a good choice when:

  • score distortion is smooth and monotonic
  • you want a stable, compact mapping
  • you want less risk of overfitting calibration on smaller data

In your pipeline, XGBoost used this approach.

Isotonic regression
Isotonic regression fits a monotonic step-like mapping from raw scores to calibrated probabilities. It is more flexible than Platt scaling, because it does not force an S-shape. It is useful when:

  • calibration error is not well-described by one smooth curve
  • you want a more data-driven mapping
  • the score behavior is irregular (common with tree ensembles)

Your other tree models used isotonic regression.

Why this split makes sense in your setup

This was a practical choice for the model families you trained:

  • XGBoost often benefits from a smoother probability remapping (Platt/sigmoid)
  • Bagged/ensemble trees can have more jagged score distributions, where isotonic can adapt better

That is consistent with the calibration design in your pipeline. The key point is that calibration was treated as a probability-quality step, not as a retraining trick.

// Code Snippet — Applying Calibration
from sklearn.calibration import CalibratedClassifierCV

# XGBoost -> Platt scaling (sigmoid)
xgb_calibrated = CalibratedClassifierCV(
    xgb_best_model,
    method="sigmoid",
    cv=5
)

# Other tree models -> isotonic regression
rf_calibrated = CalibratedClassifierCV(
    rf_best_model,
    method="isotonic",
    cv=5
)

xgb_calibrated.fit(X_train, y_train)
rf_calibrated.fit(X_train, y_train)
. . .

4. Alternatives You Could Use

For readers who haven’t touched calibration before, here are the common alternatives:

  • Temperature scaling: Often used in deep learning. It adjusts confidence with one scalar “temperature.” Very clean for neural nets, less common as the main choice for mixed tree ensembles.
  • Histogram binning / quantile binning: Groups scores into bins and recalculates empirical fraud rates per bin. Easy to explain, but can become coarse and unstable depending on sample size.
  • Beta calibration: A more flexible parametric calibration method than Platt in some settings. Useful, but less standard in many scikit-style pipelines.

Why your setup is strong for this project

Using CalibratedClassifierCV with Platt + isotonic is a very practical and defensible choice for classical ML fraud pipelines: widely used, integrates cleanly with sklearn workflows, easy to reproduce, and provides clear separation between model fitting and probability correction.

And crucially, your results show why calibration must be evaluated with context: calibration improves probability quality, but it can move recall at a fixed threshold because the score scale changes. That behavior appears directly in your metrics.

. . .

5. What Calibration Changes in Practice

Calibration does not change the underlying fraud pattern learned by the model.

It changes how score values map to risk.

That means the same threshold (for example, 0.5) can behave differently before and after calibration, because 0.5 now means something different on the recalibrated scale.

Your project notes exactly this effect in the tree models: recall often drops at the default 0.5 threshold after calibration, because the probability scale has been reshaped. This is expected behavior, and it is one reason threshold selection remains an operational choice after calibration.

Simple metaphor

Think of calibration like changing from Fahrenheit to Celsius.

The weather did not change. The numbers changed. If you keep the same cutoff number after changing the scale, you will make weird decisions.

That is why “calibrate first, then re-check thresholds” is the grown-up move.

“Calibration aligns predicted fraud probabilities with observed fraud frequency, improving the trustworthiness of risk scores used in triage.”
“Calibration reshapes the score distribution. A threshold like 0.5 will select different claims after calibration, even when the model’s learned patterns remain the same.”
. . .

6. Calibration Results (Original Feature Representation)

Below is a clean summary of the calibration comparison at the default threshold (0.5) for the Original dataset representation.

Model Version Precision Recall F1 PR-AUC ROC-AUC
RandomForest Uncalibrated 0.6207 0.7347 0.6729 0.5356 0.7883
RandomForest Calibrated 0.6275 0.6531 0.6400 0.5510 0.7940
XGBoost Uncalibrated 0.6087 0.5714 0.5895 0.4545 0.7546
XGBoost Calibrated 0.6078 0.6327 0.6200 0.5447 0.8077
ExtraTrees Uncalibrated 0.6207 0.7347 0.6729 0.5180 0.7804
ExtraTrees Calibrated 0.6222 0.5714 0.5957 0.4860 0.7725
“Calibration affects threshold metrics and ranking metrics differently across model families; the impact depends on score-shape distortion.”

What stands out here

XGBoost benefited a lot after calibration on the Original dataset. PR-AUC improved noticeably, ROC-AUC improved over 5 points, and Recall improved at threshold 0.5. That is a strong example of calibration improving score quality and also helping operational behavior at the default threshold.

. . .

7. Calibration Results (Tree-Optimized Feature Representation)

You also evaluated calibrated vs uncalibrated models on the Trees dataset representation.

Model Version Precision Recall F1 PR-AUC ROC-AUC
XGBoost Uncalibrated 0.6364 0.8571 0.7304 0.5583 0.8214
XGBoost Calibrated 0.6129 0.7755 0.6858 0.5227 0.8014
ExtraTrees Uncalibrated 0.6406 0.8367 0.7257 0.5448 0.8134
ExtraTrees Calibrated 0.6034 0.7143 0.6542 0.5137 0.8019
VotingEnsemble Uncalibrated 0.6269 0.8571 0.7242 0.5538 0.8192
VotingEnsemble Calibrated 0.6071 0.6939 0.6476 0.5421 0.8154

What stands out here

For the strongest tree models, calibration reduced recall at the default threshold (0.5). Your own summary correctly explains why: calibration changed the probability scale, so 0.5 now cuts the score distribution differently.

Example: Trees / XGBoost recall drops from 0.8571 → 0.7755 after calibration at threshold 0.5.

This does not mean the calibrated model is “worse” in a general sense. It means the threshold now needs to be re-tuned on the new score scale if your goal is recall-heavy capture. That is exactly why separating Threshold Design and Calibration into different reports was the right move.

. . .

8. What Changed Operationally After Calibration

Calibration gives you a cleaner foundation for:

Risk ranking

When analysts open a queue sorted by fraud probability, calibrated scores are more trustworthy for ordering and prioritization. Your pipeline explicitly frames calibrated outputs as better for ranking/triage.

Probability-based policies

If you later define business rules like:

  • > 0.80 = urgent review
  • 0.55–0.80 = standard review
  • < 0.55=low-risk queue

those cutoffs make more sense when the score scale has been calibrated.

Model communication

Calibration also helps when presenting results to non-ML stakeholders. “Risk score” becomes less mystical and more numerically defensible.

That matters in fraud work, because product and operations teams will ask:

  • “Why is this one higher than that one?”
  • “Can we trust 0.72?”
  • “What happens if we shift the cutoff?”

Calibrated probabilities make those conversations less hand-wavy.

. . .

9. Practical Guidance for Using Calibrated vs Uncalibrated Models in This System

Your project’s design already points to a very practical split:

Keep uncalibrated models for capture-oriented threshold sweeps

These often preserve the strongest recall behavior for aggressive fraud catching when you tune thresholds directly. Your recorded notes explicitly keep them for that use case.

Use calibrated models for risk scoring and triage queues

These are better when the numeric probability itself matters (rank ordering, queue prioritization, probability-based rules, downstream decision contracts). This is a strong systems decision because it avoids forcing one artifact to do two jobs badly.

. . .

Key Takeaway

Calibration is the stage where model scores stop being “just outputs” and start becoming usable risk estimates.

In this fraud pipeline, tree models were calibrated using CalibratedClassifierCV, with Platt scaling for XGBoost and isotonic regression for the other tree ensembles. The project also preserves both calibrated and uncalibrated versions because they support different operational goals: high-recall thresholding vs trustworthy probability-based triage.

The results show a pattern that matters in real deployments:

  • calibration can improve ranking quality and probability trustworthiness,
  • and it can also change recall/precision at the same threshold because the score scale has shifted.

In plain English: the model already knew who looked suspicious; calibration taught it how to speak in more honest numbers.