Purpose
This report documents the probability calibration stage for the tuned tree models.
By this point, the models were already good at separating suspicious claims from normal ones. The remaining issue was more subtle: the score values themselves (0.2, 0.6, 0.9, etc.) are often not “honest probabilities” out of the box for tree ensembles.
In fraud work, that matters a lot.
A score is used for more than a yes/no decision. It is also used for:
- ranking which claims analysts review first
- setting triage cutoffs by team capacity
- comparing risk across batches
- generating explainable risk narratives
If the score scale is distorted, the queue becomes messy even when the classifier still looks good on recall.
This stage calibrates the tree models so the probability scale becomes more reliable for decision support. The project also keeps both versions (uncalibrated and calibrated), because they serve different operational purposes.
The central question:
1. Why Calibration Exists at All
Tree models (RandomForest, XGBoost, ExtraTrees, AdaBoost, Bagging) often produce strong ranking performance, but their raw probability outputs can be “overconfident” or “underconfident.”
Simple version:
- the model may rank fraud cases correctly,
- but the number 0.80 does not always mean “about 80% fraud likelihood.”
That is exactly the problem calibration solves: it reshapes the score scale so probability values behave more consistently for ranking and triage decisions. This calibration step was explicitly introduced for that reason in your pipeline.
Plain-language metaphor
Imagine a bathroom scale that always puts heavier people above lighter people, so the ordering is fine — but it adds +7 kg to everyone. It still sorts people correctly, but the numbers are wrong.
Calibration fixes the numbers, not just the order.
That distinction is gold in fraud systems.
2. Why Calibration Matters Specifically for Fraud
Fraud detection is rarely “predict and forget.” It behaves like a triage queue:
- Claims are sorted by risk score
- A review team has limited capacity
- A threshold is set depending on workload and tolerance for misses
That means two separate needs exist at the same time:
A) Capture-oriented mode (recall-heavy)
Useful when you want aggressive fraud capture and are willing to review more claims.
B) Probability-oriented mode (risk scoring / ranking)
Useful when the team depends on the score itself for prioritization, routing, or consistent downstream logic.
Your project explicitly keeps both uncalibrated and calibrated models because those two modes are real, not theoretical. Uncalibrated models were kept for aggressive threshold tuning, while calibrated versions were kept for trustworthy risk scores and probability-based decisions.
That is a very sane system design choice. Many people mash these together and wonder why operations get weird later.
3. Calibration Methods Used in This Project
Your pipeline used CalibratedClassifierCV for tree model calibration, with two calibration
strategies depending on the model family:
- XGBoost → Platt scaling (sigmoid)
- Other tree models → Isotonic regression
What these methods do (simple explanation)
Platt scaling (sigmoid)
Platt scaling fits a smooth S-shaped curve on top of the model’s raw scores. It is a good choice when:
- score distortion is smooth and monotonic
- you want a stable, compact mapping
- you want less risk of overfitting calibration on smaller data
In your pipeline, XGBoost used this approach.
Isotonic regression
Isotonic regression fits a monotonic step-like mapping from raw scores to calibrated probabilities. It is more
flexible than Platt scaling, because it does not force an S-shape. It is useful when:
- calibration error is not well-described by one smooth curve
- you want a more data-driven mapping
- the score behavior is irregular (common with tree ensembles)
Your other tree models used isotonic regression.
Why this split makes sense in your setup
This was a practical choice for the model families you trained:
- XGBoost often benefits from a smoother probability remapping (Platt/sigmoid)
- Bagged/ensemble trees can have more jagged score distributions, where isotonic can adapt better
That is consistent with the calibration design in your pipeline. The key point is that calibration was treated as a probability-quality step, not as a retraining trick.
// Code Snippet — Applying Calibration
from sklearn.calibration import CalibratedClassifierCV
# XGBoost -> Platt scaling (sigmoid)
xgb_calibrated = CalibratedClassifierCV(
xgb_best_model,
method="sigmoid",
cv=5
)
# Other tree models -> isotonic regression
rf_calibrated = CalibratedClassifierCV(
rf_best_model,
method="isotonic",
cv=5
)
xgb_calibrated.fit(X_train, y_train)
rf_calibrated.fit(X_train, y_train)
4. Alternatives You Could Use
For readers who haven’t touched calibration before, here are the common alternatives:
- Temperature scaling: Often used in deep learning. It adjusts confidence with one scalar “temperature.” Very clean for neural nets, less common as the main choice for mixed tree ensembles.
- Histogram binning / quantile binning: Groups scores into bins and recalculates empirical fraud rates per bin. Easy to explain, but can become coarse and unstable depending on sample size.
- Beta calibration: A more flexible parametric calibration method than Platt in some settings. Useful, but less standard in many scikit-style pipelines.
Why your setup is strong for this project
Using CalibratedClassifierCV with Platt + isotonic is a very practical and defensible choice for
classical ML fraud pipelines: widely used, integrates cleanly with sklearn workflows, easy to reproduce, and
provides clear separation between model fitting and probability correction.
And crucially, your results show why calibration must be evaluated with context: calibration improves probability quality, but it can move recall at a fixed threshold because the score scale changes. That behavior appears directly in your metrics.
5. What Calibration Changes in Practice
Calibration does not change the underlying fraud pattern learned by the model.
It changes how score values map to risk.
That means the same threshold (for example, 0.5) can behave differently before and after calibration, because 0.5 now means something different on the recalibrated scale.
Your project notes exactly this effect in the tree models: recall often drops at the default 0.5 threshold after calibration, because the probability scale has been reshaped. This is expected behavior, and it is one reason threshold selection remains an operational choice after calibration.
Simple metaphor
Think of calibration like changing from Fahrenheit to Celsius.
The weather did not change. The numbers changed. If you keep the same cutoff number after changing the scale, you will make weird decisions.
That is why “calibrate first, then re-check thresholds” is the grown-up move.
6. Calibration Results (Original Feature Representation)
Below is a clean summary of the calibration comparison at the default threshold (0.5) for the Original dataset representation.
| Model | Version | Precision | Recall | F1 | PR-AUC | ROC-AUC |
|---|---|---|---|---|---|---|
| RandomForest | Uncalibrated | 0.6207 | 0.7347 | 0.6729 | 0.5356 | 0.7883 |
| RandomForest | Calibrated | 0.6275 | 0.6531 | 0.6400 | 0.5510 | 0.7940 |
| XGBoost | Uncalibrated | 0.6087 | 0.5714 | 0.5895 | 0.4545 | 0.7546 |
| XGBoost | Calibrated | 0.6078 | 0.6327 | 0.6200 | 0.5447 | 0.8077 |
| ExtraTrees | Uncalibrated | 0.6207 | 0.7347 | 0.6729 | 0.5180 | 0.7804 |
| ExtraTrees | Calibrated | 0.6222 | 0.5714 | 0.5957 | 0.4860 | 0.7725 |
What stands out here
XGBoost benefited a lot after calibration on the Original dataset. PR-AUC improved noticeably, ROC-AUC improved over 5 points, and Recall improved at threshold 0.5. That is a strong example of calibration improving score quality and also helping operational behavior at the default threshold.
7. Calibration Results (Tree-Optimized Feature Representation)
You also evaluated calibrated vs uncalibrated models on the Trees dataset representation.
| Model | Version | Precision | Recall | F1 | PR-AUC | ROC-AUC |
|---|---|---|---|---|---|---|
| XGBoost | Uncalibrated | 0.6364 | 0.8571 | 0.7304 | 0.5583 | 0.8214 |
| XGBoost | Calibrated | 0.6129 | 0.7755 | 0.6858 | 0.5227 | 0.8014 |
| ExtraTrees | Uncalibrated | 0.6406 | 0.8367 | 0.7257 | 0.5448 | 0.8134 |
| ExtraTrees | Calibrated | 0.6034 | 0.7143 | 0.6542 | 0.5137 | 0.8019 |
| VotingEnsemble | Uncalibrated | 0.6269 | 0.8571 | 0.7242 | 0.5538 | 0.8192 |
| VotingEnsemble | Calibrated | 0.6071 | 0.6939 | 0.6476 | 0.5421 | 0.8154 |
What stands out here
For the strongest tree models, calibration reduced recall at the default threshold (0.5). Your own summary correctly explains why: calibration changed the probability scale, so 0.5 now cuts the score distribution differently.
Example: Trees / XGBoost recall drops from 0.8571 → 0.7755 after calibration at threshold 0.5.
This does not mean the calibrated model is “worse” in a general sense. It means the threshold now needs to be re-tuned on the new score scale if your goal is recall-heavy capture. That is exactly why separating Threshold Design and Calibration into different reports was the right move.
8. What Changed Operationally After Calibration
Calibration gives you a cleaner foundation for:
Risk ranking
When analysts open a queue sorted by fraud probability, calibrated scores are more trustworthy for ordering and prioritization. Your pipeline explicitly frames calibrated outputs as better for ranking/triage.
Probability-based policies
If you later define business rules like:
- > 0.80 = urgent review
- 0.55–0.80 = standard review
- < 0.55=low-risk queue
those cutoffs make more sense when the score scale has been calibrated.
Model communication
Calibration also helps when presenting results to non-ML stakeholders. “Risk score” becomes less mystical and more numerically defensible.
That matters in fraud work, because product and operations teams will ask:
- “Why is this one higher than that one?”
- “Can we trust 0.72?”
- “What happens if we shift the cutoff?”
Calibrated probabilities make those conversations less hand-wavy.
9. Practical Guidance for Using Calibrated vs Uncalibrated Models in This System
Your project’s design already points to a very practical split:
Keep uncalibrated models for capture-oriented threshold sweeps
These often preserve the strongest recall behavior for aggressive fraud catching when you tune thresholds directly. Your recorded notes explicitly keep them for that use case.
Use calibrated models for risk scoring and triage queues
These are better when the numeric probability itself matters (rank ordering, queue prioritization, probability-based rules, downstream decision contracts). This is a strong systems decision because it avoids forcing one artifact to do two jobs badly.
Key Takeaway
Calibration is the stage where model scores stop being “just outputs” and start becoming usable risk estimates.
In this fraud pipeline, tree models were calibrated using CalibratedClassifierCV, with
Platt scaling for XGBoost and isotonic regression for the other tree
ensembles. The project also preserves both calibrated and uncalibrated versions because they
support different operational goals: high-recall thresholding vs trustworthy probability-based triage.
The results show a pattern that matters in real deployments:
- calibration can improve ranking quality and probability trustworthiness,
- and it can also change recall/precision at the same threshold because the score scale has shifted.
In plain English: the model already knew who looked suspicious; calibration taught it how to speak in more honest numbers.