Purpose

This report defines how model probabilities were converted into actual fraud alerts.

By this stage, the core tree models already produced strong ranking performance. The next step was operational: choosing score cutoffs that determine which claims get flagged for investigation.

That sounds simple. It is not. A threshold is where the math meets the team’s capacity.

A model can score every claim from 0 to 1, but investigators do not review probabilities — they review cases. Threshold design turns scores into a queue.

This report documents how the alerting threshold was selected under a recall-first fraud detection policy, and how threshold tuning was used to explicitly manage the trade-off between:

  • frauds caught (TP)
  • frauds missed (FN)
  • false alarms (FP)
  • review workload (flagged volume)

The central question:

“Which threshold gives the best operational fraud-catching behavior without creating a nonsense workload?”
. . .

1. Why Threshold Design Is a Separate Stage

The model training stage answers: “Can the model rank risky claims above safer ones?”

Threshold design answers: “At what score do we stop debating and open an investigation?”

Those are different jobs.

A useful metaphor:

  • Model training builds the metal detector.
  • Threshold design sets how sensitive the detector should be.

Set it too low, and it beeps at every bottle cap. Set it too high, and someone walks through with a sword and a suspicious smile.

In this project, threshold design was handled explicitly because fraud detection is a recall-first use case: missing fraud (FN) is usually more expensive than reviewing an extra legitimate claim (FP). This framing is called out directly in the evaluation logic and error analysis.

. . .

2. Operational Evaluation Framing: Mode A (Alerting)

The project evaluation is split into two modes:

  • Mode A — Operational alerting (decision mode): Thresholds are selected to support fraud catching with a recall-heavy objective.
  • Mode B — Triage (ranking mode): Models are compared by ranking quality (PR-AUC) for prioritization.

This report focuses on Mode A, the alerting mode, where thresholds are tuned to maximize recall-weighted performance (F2).

The workflow summary explicitly includes threshold simulation from 0.1 to 0.9 to convert model scores into business outcomes — not just abstract metrics.

“Operational alerting and triage use the same model scores, but they answer different business questions.”
. . .

3. Threshold Simulation: Turning Scores into Workload

Threshold simulation was performed across a range of cutoffs to observe how behavior changes as the cutoff moves.

The project explicitly tracks threshold effects in operational terms:

  • Recall (how much fraud gets caught)
  • Precision (how many flagged cases are actually fraud)
  • F2 (recall-weighted performance)
  • Flagged volume (how many claims investigators must review)

This is the important bit many toy ML projects skip: flagged volume is included directly, because teams have finite time and finite patience.

Example behavior observed in threshold simulation

For Trees / RandomForest:

  • At threshold 0.4: 66 claims flagged, Recall = 0.8571, F2 = 0.8015
  • At threshold 0.5: 64 claims flagged, Recall = 0.8367, F2 = 0.7885

For Trees / ExtraTrees (similar pattern):

  • Threshold 0.3–0.4: Recall = 0.8571, F2 = 0.8015, 66 claims flagged
  • Threshold 0.5: Recall = 0.8367, F2 = 0.7854

That is the core operational lesson in one page: a tiny threshold move can mean a few extra reviews and a few fewer missed frauds.

// Code Snippet — Threshold Sweep (Conceptual Pattern)
thresholds = [i / 10 for i in range(1, 10)]

rows = []
for t in thresholds:
    y_pred = (y_proba >= t).astype(int)

    tp = int(((y_pred == 1) & (y_test == 1)).sum())
    fp = int(((y_pred == 1) & (y_test == 0)).sum())
    fn = int(((y_pred == 0) & (y_test == 1)).sum())

    precision = precision_score(y_test, y_pred, zero_division=0)
    recall = recall_score(y_test, y_pred, zero_division=0)
    f2 = fbeta_score(y_test, y_pred, beta=2, zero_division=0)

    rows.append({
        "threshold": t,
        "flagged": int(y_pred.sum()),
        "TP": tp,
        "FP": fp,
        "FN": fn,
        "precision": precision,
        "recall": recall,
        "F2": f2
    })

Why include this snippet: It shows readers that threshold tuning is not magical hand-waving. It is a direct simulation loop over score cutoffs, with operational outputs tracked explicitly.

. . .

4. Chosen Operating Point (Mode A): Recall-First Alerting

After threshold tuning, the strongest tree models converged to nearly the same practical operating point on the tree-optimized dataset.

Headline recall-first operating outcome

  • Frauds caught (TP): 42
  • Frauds missed (FN): 7
  • False alarms (FP): 23
  • Recall: 0.857 (~86% of fraud caught)

In business terms, this operating point catches most fraud in the evaluation slice while generating a manageable (though real) investigation burden. That framing is explicitly documented as a realistic recall-first trade-off profile.

Model Threshold TP FP FN Recall
RandomForest (uncalibrated) 0.4895 42 23 7 0.8571
XGBoost (uncalibrated) 0.5409 42 23 7 0.8571
ExtraTrees (uncalibrated) 0.4548 41 23 8 0.8367
“Different tree models require different numeric thresholds, but they converge to nearly the same operational outcome.”

This is a great detail for readers, because it teaches something subtle: the exact threshold value depends on the model’s score scale. A threshold of 0.49 for one model and 0.54 for another can represent a very similar alerting policy in practice.

. . .

5. Why Models Converged to Similar Alerting Outcomes

The report notes a very practical pattern: RandomForest, ExtraTrees, and XGBoost learned very similar high-risk rankings. Once thresholds were tuned for the same objective (F2), they flagged nearly the same claims.

That is an excellent AI PM takeaway.

At that point, model selection moves beyond “who catches more fraud?” and shifts toward decisions like:

  • stability
  • score quality
  • speed
  • explainability fit
  • operational integration

The project explicitly frames this as a practical decision once operational behavior converges.

“Threshold tuning controls both fraud capture and investigation load. The curve is not just model performance — it is staffing policy in graph form.”
. . .

6. Error Cost Framing Behind Threshold Choice

Threshold design here is guided by explicit error-cost reasoning:

  • False Negatives (FN) = missed fraud. Highest-cost failure in a recall-first fraud filter.
  • False Positives (FP) = legitimate claims flagged. Investigation workload and friction, but usually preferable to missed fraud.

This error framing directly shaped the modeling and threshold policy:

  • optimize for F2 (recall-weighted)
  • simulate thresholds instead of blindly keeping 0.5
  • carry forward context features that help reduce missed fraud in practice

That is the product-management layer of the work. The threshold is not a technical default; it is a business control.

“Confusion matrix at the selected recall-first operating point. This is the practical cost profile the team would be accepting.”
. . .

7. What This Report Contributes to the Overall System

Threshold design is where the fraud model starts behaving like a decision system instead of a leaderboard entry.

It adds three things that matter in production conversations:

  1. Policy control: You can explain why the threshold is set where it is.
  2. Capacity awareness: You can estimate investigation load, not only model quality.
  3. Auditability: You can revisit the threshold later when business constraints change.

This also prepares the ground for the next report (Calibration), because once probabilities are recalibrated, the same numeric threshold may no longer represent the same decision boundary. That behavior is already visible in the observed recall shifts at threshold 0.5 after calibration.

. . .

Key Takeaway

Threshold design converts model scores into operational behavior.

In this project, thresholds were tuned explicitly under a recall-first policy, and the main tree models converged to a similar alerting outcome:

  • 42 frauds caught
  • 7 frauds missed
  • 23 false alarms
  • ~86% recall

That outcome is strong enough to be useful and honest enough to be operationally discussed. It also makes one thing very clear:

The threshold is not a decorative setting. It is the dial that controls what the fraud team actually experiences day to day.