Purpose

This report documents the bias and fairness audit of the fraud detection system following model calibration and SHAP-based interpretability analysis.

Fraud detection models differ from allocation systems such as credit scoring or hiring models. They do not distribute resources. Instead, they prioritize claims for investigation. The fairness question therefore shifts from “who receives” to “who is disproportionately scrutinized.”

The model was tuned with an F2 objective, prioritizing recall in order to reduce missed fraud cases. While this improves financial risk coverage, it increases the number of flagged claims. Fairness evaluation must therefore focus on how investigative burden and error rates are distributed across segments.

This report evaluates:

  • Proxy feature stability
  • Segment-level false positive and false negative rates
  • Concentration of flagged claims
  • Threshold effects on burden distribution
  • Monitoring implications for production deployment
. . .

1. Fairness Framing in Fraud Systems

In credit scoring, fairness is often defined through approval parity or equalized odds. In fraud detection, the outcome is suspicion rather than approval. A false positive does not deny a resource but triggers additional scrutiny. However, if specific groups are consistently flagged at higher rates without clear operational justification, the system can create disproportionate investigative pressure.

Fraud models are particularly sensitive to this issue because:

  • They operate in rare-event settings
  • They amplify outliers
  • They frequently rely on high-cardinality categorical features

This makes proxy risk and error dispersion especially important to examine.

. . .

2. Proxy Feature Assessment

insured_hobbies as a Stability Risk

SHAP analysis identified insured_hobbies as a high-impact feature in global importance rankings.

The presence of predictive signal alone is insufficient justification for production use. Features must also demonstrate operational plausibility. In this case, hobbies lack a clear causal relationship to fraud likelihood. Their predictive contribution may reflect indirect correlations with age, socioeconomic background, or dataset construction artifacts.

To assess dependency on this feature, a removal simulation was conducted.

Feature Removal Simulation

Model Variant PR-AUC Recall FPR
Full Feature Set 0.41 0.82 0.23
Without Hobbies 0.39 0.79 0.20

Removing hobbies reduces performance slightly but also lowers false positive concentration across certain occupation segments.

This suggests that while hobbies contribute signal, they may also amplify segment-specific risk patterns. In production settings, such trade-offs must be explicitly evaluated rather than implicitly accepted.

. . .

3. Segment-Level Error Distribution

Fairness in fraud detection is best assessed through comparison of error rates across segments.

The following metrics were evaluated:

  • Flag Rate (percentage of claims flagged)
  • False Positive Rate (FPR)
  • False Negative Rate (FNR)

Illustrative Segment Distribution

Segment Flag Rate FPR FNR
Age < 25 18% 0.12 0.19
Age 50+ 24% 0.18 0.10
Region A 30% 0.22 0.09
Hobby Cluster X 35% 0.25 0.08

Even when global metrics appear stable, disproportionate concentration of false positives can indicate structural imbalance.

A visual representation clarifies this effect:

False Positive Rate by Segment

Segment A   ████████████ 0.12
Segment B   ██████████████████ 0.18
Region A    ███████████████████████ 0.22
Hobby X     ███████████████████████████ 0.25

Higher FPR in specific segments increases investigation burden for those groups. Whether this reflects true risk differences or proxy amplification must be carefully evaluated.

. . .

4. Threshold Effects on Burden Distribution

The model is optimized for recall. Lowering the classification threshold increases sensitivity and reduces false negatives. However, this also expands the number of flagged claims.

Conceptually:

Higher threshold → fewer claims flagged → lower FPR
Lower threshold  → more claims flagged → higher FPR

If baseline fraud probabilities differ slightly across segments, threshold adjustments can amplify those differences.

For example, if one segment’s average predicted probability is marginally higher due to correlated features, lowering the threshold increases its flagging rate disproportionately. This makes threshold selection both a performance decision and a fairness control parameter.

. . .

5. Temporal Stability Considerations

Fraud patterns evolve over time. Regional trends, claim types, and reporting behaviors can shift.

A fairness evaluation performed at a single time point may not remain valid after distributional changes. Segment-level error rates should therefore be monitored longitudinally.

Without ongoing monitoring, a model that appears balanced at deployment can drift toward disproportionate burden allocation.

. . .

6. Production Monitoring Requirements

Before production deployment, the following safeguards are recommended:

  • Validation of high-impact proxy features
  • Segment-level error dashboards
  • Regular threshold review
  • Drift-triggered fairness re-evaluation

These mechanisms ensure that performance optimization does not unintentionally lead to sustained imbalance.

. . .

Conclusion

The fraud detection system demonstrates strong predictive capability and calibrated probability outputs. However, fairness analysis highlights:

  • Sensitivity to proxy features
  • Error concentration risk under recall-heavy tuning
  • Dependence on threshold selection
  • Need for temporal monitoring

These findings do not invalidate the model. They clarify the operational conditions required for responsible deployment.