Purpose
This report documents the linear-model tuning stage of the fraud detection pipeline.
At this point, the data had already been cleaned and reshaped for linear modeling (including step-like binary flags, tenure buckets, and missingness indicators). The work in this stage focused on two practical questions:
- How far can a linear model go on this problem after feature engineering?
- Which probability threshold gives a workable fraud-detection tradeoff?
The result is a tuned Logistic Regression baseline that is useful for transparency, comparison, and sanity checks in later stages.
1. Why a Linear Model Was Tuned at All
Linear models are valuable in fraud work because they are easier to inspect and explain than more complex models. They give you a clean reference point: if a more advanced model performs better, you can show exactly what extra complexity bought you.
There is also a practical reason. In regulated or operational settings, teams often want a baseline that behaves predictably and can be explained to non-technical stakeholders. Logistic Regression is the classic candidate for that job.
In plain language: a linear model draws one smooth rule across the data. That makes it easier to understand, but it also means it can miss “pockets” of fraud behavior that only appear in certain combinations of features. That limitation becomes important later in the project, and this report helps make that visible with numbers instead of vibes.
2. Inputs to the Linear Tuning Stage
The tuning was done on an engineered dataset prepared specifically for linear models. That preparation mattered a lot, because linear models usually need the feature space to be “flattened” into something they can use.
Examples of engineered inputs carried into this stage include:
- a binary flag for severe incidents (
incident_severity_is_major) - buckets for policy tenure (
days_since_bind) - explicit missingness indicators (such as collision type missing)
- categorical handling designed for linear models
This is the modeling equivalent of translating dialects before a meeting. Everyone is still talking about the same claim, but now the model can actually follow the conversation.
3. Model Chosen for Tuning: Logistic Regression
The tuned linear model in this stage is Logistic Regression.
What Logistic Regression does (simple version)
It estimates the probability of fraud by combining feature signals into a weighted score, then converts that score into a value between 0 and 1.
Think of it like a scoreboard:
- “Major damage” might add points
- “Missing collision type” might add points
- Certain tenure buckets might add or subtract points
At the end, the model says something like: “This claim looks 0.72 likely to be fraud.” The threshold decision (for example, 0.35 vs 0.50) turns that probability into an action.
That probability-first behavior is why Logistic Regression is still useful in serious systems. It gives you a ranked signal before you choose an operating policy.
// Code Snippet — Logistic Regression Setup (Representative)
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import GridSearchCV
log_reg = LogisticRegression()
param_grid = {
"C": [0.01, 0.1, 1, 10],
"penalty": ["l2"],
"solver": ["liblinear"],
"class_weight": [None, "balanced"]
}
grid = GridSearchCV(
estimator=log_reg,
param_grid=param_grid,
scoring="roc_auc",
cv=5,
n_jobs=-1
)
grid.fit(X_train, y_train)
Why this step exists
GridSearchCVtries multiple parameter combinationsscoring="roc_auc"compares how well the model ranks fraud vs non-fraud across thresholdsclass_weight="balanced"helps the model pay more attention to the minority fraud class
4. Hyperparameter Search Results
A grid search was run for Logistic Regression using AUC as the tuning objective.
- Best CV AUC: 0.7462
- Test AUC (Held-out): 0.8202
Best parameter set
C = 0.1class_weight = 'balanced'penalty = 'l2'solver = 'liblinear'
What these parameters mean in human language
C=0.1→ stronger regularization (the model is kept on a shorter leash, which reduces overfitting)class_weight='balanced'→ fraud cases get more weight during training because they are fewerl2 penalty→ shrinks coefficients smoothly rather than letting them grow wildlyliblinear solver→ a stable solver commonly used for smaller/medium tabular classification problems
This combination is a pretty sensible “grown-up baseline”: not too loose, not too rigid, and explicitly adjusted for imbalance.
5. Threshold Tuning: Turning Probabilities Into Decisions
After tuning, the model outputs probabilities. Operations teams do not investigate probabilities; they investigate claims. So the next step is choosing a threshold.
The threshold decides when a claim is flagged:
- ▶️ probability ≥ threshold → flag for review
- ▶️ probability < threshold → leave unflagged
The default threshold in many libraries is 0.50, but that is usually a poor default for imbalanced fraud problems. In your tuning stage, thresholds were swept and the best one was selected using F1 (the balance of precision and recall).
Best threshold (F1-max): 0.35
That makes operational sense. Lowering the threshold catches more fraud, but also increases false alarms. You chose the cutoff by measurement instead of wishful thinking, which is exactly how this should be done.
// Code Snippet — Threshold Sweep (Representative)
import numpy as np
from sklearn.metrics import f1_score
proba = best_model.predict_proba(X_test)[:, 1]
thresholds = np.arange(0.05, 0.95, 0.05)
scores = []
for t in thresholds:
preds = (proba >= t).astype(int)
scores.append((t, f1_score(y_test, preds)))
best_threshold, best_f1 = max(scores, key=lambda x: x[1])
print(best_threshold, best_f1)
Why this matters
The model and the decision policy are different things.
- The model learns probabilities
- The threshold defines workload and risk appetite
That separation is very PM-friendly because it maps directly to real operations: one scoring engine, multiple decision modes.
6. Test-Set Results at the Selected Threshold (0.35)
At the chosen operating threshold (0.35), the model produced the following confusion matrix:
Which means:
- True Negatives (correctly kept clear): 134
- False Positives (flagged but not fraud): 17
- False Negatives (missed fraud): 16
- True Positives (caught fraud): 33
Classification metrics at threshold 0.35
Fraud class (Class 1):
- Precision: 0.66
- Recall: 0.67
- F1: 0.67
- Overall accuracy: 0.83
How to read that in plain terms
The model catches about two-thirds of fraud cases in the test set, while about one-third of flagged claims are false alarms.
That is a usable baseline for a fraud workflow, especially as a transparent reference model. It is not perfect (fraud detection never is), but it is doing real work.
7. Why Performance Plateaus in Linear Models
Your project notes point to the core issue clearly: even after strong feature engineering and tuning, performance plateaus because fraud behavior depends on interactions and non-linear effects.
Simple explanation
A linear model is good at adding signals together.
Fraud patterns often behave more like combinations:
- severe incident + short policy tenure + missing field
- claim composition pattern + specific context
- risk spikes that exist only in pockets of the feature space
You can manually create more and more engineered features to help a linear model approximate these pockets, but eventually that becomes a giant maintenance project.
This is exactly why your workflow uses the linear model as a transparent baseline, then moves on to tree-based models for stronger recall-focused performance.
8. Operational Interpretation (PM Lens)
This stage is very useful from a product perspective because it separates three things cleanly:
- Feature design (what signals exist)
- Model ranking quality (AUC)
- Decision policy (threshold)
That separation lets you talk about tradeoffs in operational terms:
- If investigations are overloaded, raise the threshold
- If missed fraud is too costly, lower the threshold
- If leadership wants more transparency, keep the linear baseline as a benchmark
In other words, the model is the engine, and the threshold is the steering wheel. Teams usually mix those up and then wonder why nobody agrees on performance.
9. Summary
The linear tuning stage produced a strong and explainable baseline for fraud scoring:
- Logistic Regression was tuned with cross-validated AUC
- The selected configuration used regularization and class balancing
- AUC improved to 0.8202 on the held-out test set
- Threshold tuning selected 0.35 as the best F1 operating point
- At that threshold, fraud precision and recall both landed at ~0.67
- The model remains useful as a transparent baseline, while its structural limitations point toward tree-based models for stronger final performance
Key Takeaway: This stage turns a linear model into a credible operational baseline instead of a toy benchmark.
The tuning work matters because it shows what careful feature shaping and threshold selection can achieve before moving to more complex model families. It also makes later improvements easier to justify: when tree models outperform this baseline, the gain is visible, measurable, and earned.