Purpose
This report documents how the raw claims dataset was transformed into a clean, leakage-controlled, modeling-ready foundation.
The previous report established what the dataset represents. This report explains how it was engineered into a reliable analytical system.
The focus is:
- Structural correction
- Leakage elimination
- Missingness handling
- Temporal normalization
- Dual feature-engineering strategy (Trees vs Linear models)
No modeling yet. Just disciplined preparation.
1. From Raw Records to Clean Base Dataset
The initial dataset contained:
- 1,000 rows
- 40 columns
- Mixed types: numeric, categorical, temporal, identifiers
During preprocessing:
- 1 invalid temporal record was removed
- Leakage-prone identifiers were dropped
- Missing values were harmonized
- Date fields were standardized
- Derived variables were created
Final base dataset: 999 structurally valid records.
This stage ensured the dataset could support reliable modeling without hidden shortcuts.
2. Leakage Control: Removing Hidden Answer Keys
Some columns contained information that would allow a model to “cheat.”
Removed:
policy_number(unique identifier)incident_location(unique street-level detail)_c39(empty column)insured_zip(near-identifier, almost 1:1 with city/state)auto_year(replaced with vehicle_age)
Why this matters:
If a feature uniquely identifies a record, the model can memorize fraud labels instead of learning patterns. That produces artificially high accuracy but fails in production.
In short: we removed lookup keys disguised as data.
3. Missingness as Signal (Not Noise)
The dataset used both "?" and NaN.
These were harmonized into a unified "Missing" category for:
collision_typeproperty_damagepolice_report_availableauthorities_contacted
Instead of imputing values (e.g., filling with “No”), we created explicit missingness indicators.
Why?
Because early inspection showed that missing documentation disproportionately appeared in fraudulent cases.
In fraud detection, absence of paperwork is not random. It may reflect ambiguity, evasion, or incomplete reporting.
So we encoded missingness explicitly rather than hiding it.
4. Temporal Standardization
Two date fields:
policy_bind_dateincident_date
Both were converted to datetime format.
We then derived:
days_since_bind = incident_date − policy_bind_date
One record where the policy began after the incident was removed. Raw dates were dropped after transformation.
Why derive instead of keeping raw dates?
Because models do not understand calendar meaning. But they understand measurable quantities like:
- policy age
- month
- weekday
- weekend
- holiday proximity
We transformed timestamps into interpretable behavioral signals.
5. Numeric Corrections
One anomaly was detected:
umbrella_limitcontained negative values.
These were corrected using absolute value transformation.
This was not modeling enhancement — just restoring logical consistency.
6. Feature Engineering Strategy: One Dataset Is Not Enough
A key architectural decision in Step 1 was building two separate modeling datasets:
- Trees dataset
- Linear dataset
Why two?
Because tree models and linear models interpret structure differently.
7. Shared Feature Engineering
Applied to both pipelines:
7.1 Temporal Features
days_since_bindincident_monthvehicle_agehour_sin,hour_cos
Cyclical encoding explanation:
Hours repeat every 24 hours. 23:00 and 00:00 are neighbors in real life but numerically far apart (23 vs 0).
Sine and cosine encoding places them close in geometric space. Think of time as a circle, not a line.
7.2 Claim Composition Features
To reduce dominance of large monetary values:
injury_share = injury_claim / total_claim_amountproperty_share = property_claim / total_claim_amount
Instead of only learning “big claims are risky,” the model can learn patterns like:
- “High injury share with low property damage”
- or “Unusual claim composition”
8. Trees-Optimized Dataset
Tree models can naturally handle:
- Non-linear interactions
- Raw categorical variables
- High cardinality features
Therefore, retained:
auto_makeauto_modelinsured_hobbies
Minimal bucketing was applied. Trees decide splits like: “If auto_model = X and collision_type = Y → higher risk.” No coefficient instability.
9. Linear-Optimized Dataset
Linear models assume straight-line relationships. But many real-world variables behave in curves.
Example: Fraud risk might increase sharply below 3 months tenure, then flatten.
A straight line cannot represent that shape well. So we used bucketing.
9.1 Risk Bucketing
Continuous variables were segmented:
customer_tenure_bucketclaim_amount_risk_bandumbrella_limit_binpolicy_annual_premium_binvehicle_age_bucket
What bucketing does:
Instead of asking the model to draw one straight line across all values, we group similar ranges together.
Imagine approximating a curve using stairs instead of a single ramp. It’s still simple — but much more flexible.
10. Final Preprocessing Outputs
After all transformations:
Base cleaned dataset:
- 999 valid records
- Leakage removed
- Temporal features derived
- Missingness encoded
Two modeling-ready datasets:
- Trees dataset
- Categorical richness retained
- Share features included
- Interaction-friendly
- Linear dataset
- Bucketed non-linear variables
- Structured temporal semantics
- Controlled cardinality
All transformations were deterministic and documented in the preprocessing pipeline.
Key Takeaway
This stage accomplished:
- Structural data correction
- Leakage prevention
- Explicit encoding of behavioral absence
- Temporal normalization
- Multicollinearity reduction via share features
- Model-family-aligned feature architecture
No modeling was performed here. But without this stage, any modeling results would be unreliable.