Purpose

This report documents the linear-model tuning stage of the fraud detection pipeline.

At this point, the data had already been cleaned and reshaped for linear modeling (including step-like binary flags, tenure buckets, and missingness indicators). The work in this stage focused on two practical questions:

  • How far can a linear model go on this problem after feature engineering?
  • Which probability threshold gives a workable fraud-detection tradeoff?

The result is a tuned Logistic Regression baseline that is useful for transparency, comparison, and sanity checks in later stages.

. . .

1. Why a Linear Model Was Tuned at All

Linear models are valuable in fraud work because they are easier to inspect and explain than more complex models. They give you a clean reference point: if a more advanced model performs better, you can show exactly what extra complexity bought you.

There is also a practical reason. In regulated or operational settings, teams often want a baseline that behaves predictably and can be explained to non-technical stakeholders. Logistic Regression is the classic candidate for that job.

In plain language: a linear model draws one smooth rule across the data. That makes it easier to understand, but it also means it can miss “pockets” of fraud behavior that only appear in certain combinations of features. That limitation becomes important later in the project, and this report helps make that visible with numbers instead of vibes.

. . .

2. Inputs to the Linear Tuning Stage

The tuning was done on an engineered dataset prepared specifically for linear models. That preparation mattered a lot, because linear models usually need the feature space to be “flattened” into something they can use.

Examples of engineered inputs carried into this stage include:

  • a binary flag for severe incidents (incident_severity_is_major)
  • buckets for policy tenure (days_since_bind)
  • explicit missingness indicators (such as collision type missing)
  • categorical handling designed for linear models

This is the modeling equivalent of translating dialects before a meeting. Everyone is still talking about the same claim, but now the model can actually follow the conversation.

Concept diagram of engineered inputs feeding a logistic regression
“Linear tuning was performed on a feature set explicitly shaped for linear decision-making.”
. . .

3. Model Chosen for Tuning: Logistic Regression

The tuned linear model in this stage is Logistic Regression.

What Logistic Regression does (simple version)

It estimates the probability of fraud by combining feature signals into a weighted score, then converts that score into a value between 0 and 1.

Think of it like a scoreboard:

  • “Major damage” might add points
  • “Missing collision type” might add points
  • Certain tenure buckets might add or subtract points

At the end, the model says something like: “This claim looks 0.72 likely to be fraud.” The threshold decision (for example, 0.35 vs 0.50) turns that probability into an action.

That probability-first behavior is why Logistic Regression is still useful in serious systems. It gives you a ranked signal before you choose an operating policy.

// Code Snippet — Logistic Regression Setup (Representative)
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import GridSearchCV

log_reg = LogisticRegression()

param_grid = {
    "C": [0.01, 0.1, 1, 10],
    "penalty": ["l2"],
    "solver": ["liblinear"],
    "class_weight": [None, "balanced"]
}

grid = GridSearchCV(
    estimator=log_reg,
    param_grid=param_grid,
    scoring="roc_auc",
    cv=5,
    n_jobs=-1
)

grid.fit(X_train, y_train)

Why this step exists

  • GridSearchCV tries multiple parameter combinations
  • scoring="roc_auc" compares how well the model ranks fraud vs non-fraud across thresholds
  • class_weight="balanced" helps the model pay more attention to the minority fraud class
. . .

4. Hyperparameter Search Results

A grid search was run for Logistic Regression using AUC as the tuning objective.

  • Best CV AUC: 0.7462
  • Test AUC (Held-out): 0.8202

Best parameter set

  • C = 0.1
  • class_weight = 'balanced'
  • penalty = 'l2'
  • solver = 'liblinear'

What these parameters mean in human language

  • C=0.1 → stronger regularization (the model is kept on a shorter leash, which reduces overfitting)
  • class_weight='balanced' → fraud cases get more weight during training because they are fewer
  • l2 penalty → shrinks coefficients smoothly rather than letting them grow wildly
  • liblinear solver → a stable solver commonly used for smaller/medium tabular classification problems

This combination is a pretty sensible “grown-up baseline”: not too loose, not too rigid, and explicitly adjusted for imbalance.

“Logistic Regression was tuned using AUC-based cross-validation to improve ranking quality before threshold selection.”
. . .

5. Threshold Tuning: Turning Probabilities Into Decisions

After tuning, the model outputs probabilities. Operations teams do not investigate probabilities; they investigate claims. So the next step is choosing a threshold.

The threshold decides when a claim is flagged:

  • ▶️ probability ≥ threshold → flag for review
  • ▶️ probability < threshold → leave unflagged

The default threshold in many libraries is 0.50, but that is usually a poor default for imbalanced fraud problems. In your tuning stage, thresholds were swept and the best one was selected using F1 (the balance of precision and recall).

Best threshold (F1-max): 0.35

That makes operational sense. Lowering the threshold catches more fraud, but also increases false alarms. You chose the cutoff by measurement instead of wishful thinking, which is exactly how this should be done.

// Code Snippet — Threshold Sweep (Representative)
import numpy as np
from sklearn.metrics import f1_score

proba = best_model.predict_proba(X_test)[:, 1]

thresholds = np.arange(0.05, 0.95, 0.05)
scores = []

for t in thresholds:
    preds = (proba >= t).astype(int)
    scores.append((t, f1_score(y_test, preds)))

best_threshold, best_f1 = max(scores, key=lambda x: x[1])
print(best_threshold, best_f1)

Why this matters

The model and the decision policy are different things.

  • The model learns probabilities
  • The threshold defines workload and risk appetite

That separation is very PM-friendly because it maps directly to real operations: one scoring engine, multiple decision modes.

“Threshold tuning selected 0.35 as the best F1 operating point, improving fraud capture compared to the default cutoff.”
“The overlap between class probability distributions explains why threshold tuning is necessary and why false positives/negatives remain unavoidable.”
. . .

6. Test-Set Results at the Selected Threshold (0.35)

At the chosen operating threshold (0.35), the model produced the following confusion matrix:

“At the selected threshold, the linear model catches 33 fraud cases and misses 16, with 17 false alarms.”

Which means:

  • True Negatives (correctly kept clear): 134
  • False Positives (flagged but not fraud): 17
  • False Negatives (missed fraud): 16
  • True Positives (caught fraud): 33

Classification metrics at threshold 0.35

Fraud class (Class 1):

  • Precision: 0.66
  • Recall: 0.67
  • F1: 0.67
  • Overall accuracy: 0.83

How to read that in plain terms

The model catches about two-thirds of fraud cases in the test set, while about one-third of flagged claims are false alarms.

That is a usable baseline for a fraud workflow, especially as a transparent reference model. It is not perfect (fraud detection never is), but it is doing real work.

. . .

7. Why Performance Plateaus in Linear Models

Your project notes point to the core issue clearly: even after strong feature engineering and tuning, performance plateaus because fraud behavior depends on interactions and non-linear effects.

Simple explanation

A linear model is good at adding signals together.

Fraud patterns often behave more like combinations:

  • severe incident + short policy tenure + missing field
  • claim composition pattern + specific context
  • risk spikes that exist only in pockets of the feature space

You can manually create more and more engineered features to help a linear model approximate these pockets, but eventually that becomes a giant maintenance project.

This is exactly why your workflow uses the linear model as a transparent baseline, then moves on to tree-based models for stronger recall-focused performance.

“One global boundary vs. recursive local partitions. Trees isolate local risk pockets that linear models miss.”
. . .

8. Operational Interpretation (PM Lens)

This stage is very useful from a product perspective because it separates three things cleanly:

  1. Feature design (what signals exist)
  2. Model ranking quality (AUC)
  3. Decision policy (threshold)

That separation lets you talk about tradeoffs in operational terms:

  • If investigations are overloaded, raise the threshold
  • If missed fraud is too costly, lower the threshold
  • If leadership wants more transparency, keep the linear baseline as a benchmark

In other words, the model is the engine, and the threshold is the steering wheel. Teams usually mix those up and then wonder why nobody agrees on performance.

. . .

9. Summary

The linear tuning stage produced a strong and explainable baseline for fraud scoring:

  • Logistic Regression was tuned with cross-validated AUC
  • The selected configuration used regularization and class balancing
  • AUC improved to 0.8202 on the held-out test set
  • Threshold tuning selected 0.35 as the best F1 operating point
  • At that threshold, fraud precision and recall both landed at ~0.67
  • The model remains useful as a transparent baseline, while its structural limitations point toward tree-based models for stronger final performance

Key Takeaway: This stage turns a linear model into a credible operational baseline instead of a toy benchmark.

The tuning work matters because it shows what careful feature shaping and threshold selection can achieve before moving to more complex model families. It also makes later improvements easier to justify: when tree models outperform this baseline, the gain is visible, measurable, and earned.