Insurance Claims Fraud Detection — Data Preprocessing Report

Purpose

This report documents how the raw claims dataset was transformed into a clean, leakage-controlled, modeling-ready foundation.

The previous report established what the dataset represents. This report explains how it was engineered into a reliable analytical system.

The focus is:

Structural correction
Leakage elimination
Missingness handling
Temporal normalization
Dual feature-engineering strategy (Trees vs Linear models)

No modeling yet. Just disciplined preparation.

1. From Raw Records to Clean Base Dataset

The initial dataset contained:

1,000 rows
40 columns
Mixed types: numeric, categorical, temporal, identifiers

During preprocessing:

1 invalid temporal record was removed
Leakage-prone identifiers were dropped
Missing values were harmonized
Date fields were standardized
Derived variables were created

Final base dataset: 999 structurally valid records.

This stage ensured the dataset could support reliable modeling without hidden shortcuts.

2. Leakage Control: Removing Hidden Answer Keys

Some columns contained information that would allow a model to “cheat.”

Removed:

policy_number (unique identifier)
incident_location (unique street-level detail)
_c39 (empty column)
insured_zip (near-identifier, almost 1:1 with city/state)
auto_year (replaced with vehicle_age)

Why this matters:

If a feature uniquely identifies a record, the model can memorize fraud labels instead of learning patterns. That produces artificially high accuracy but fails in production.

In short: we removed lookup keys disguised as data.

Columns with near-unique values introduce memorization risk and were removed.

3. Missingness as Signal (Not Noise)

The dataset used both "?" and NaN.

These were harmonized into a unified "Missing" category for:

collision_type
property_damage
police_report_available
authorities_contacted

Instead of imputing values (e.g., filling with “No”), we created explicit missingness indicators.

Why?

Because early inspection showed that missing documentation disproportionately appeared in fraudulent cases.

In fraud detection, absence of paperwork is not random. It may reflect ambiguity, evasion, or incomplete reporting.

So we encoded missingness explicitly rather than hiding it.

Missing documentation correlates with elevated fraud probability.

4. Temporal Standardization

Two date fields:

policy_bind_date
incident_date

Both were converted to datetime format.

We then derived:

days_since_bind = incident_date − policy_bind_date

One record where the policy began after the incident was removed. Raw dates were dropped after transformation.

Why derive instead of keeping raw dates?

Because models do not understand calendar meaning. But they understand measurable quantities like:

policy age
month
weekday
weekend
holiday proximity

We transformed timestamps into interpretable behavioral signals.

Short policy tenure appears more frequently among fraudulent claims.

5. Numeric Corrections

One anomaly was detected:

umbrella_limit contained negative values.

These were corrected using absolute value transformation.

This was not modeling enhancement — just restoring logical consistency.

6. Feature Engineering Strategy: One Dataset Is Not Enough

A key architectural decision in Step 1 was building two separate modeling datasets:

Trees dataset
Linear dataset

Why two?

Because tree models and linear models interpret structure differently.

7. Shared Feature Engineering

Applied to both pipelines:

7.1 Temporal Features

days_since_bind
incident_month
vehicle_age
hour_sin, hour_cos

Cyclical encoding explanation:

Hours repeat every 24 hours. 23:00 and 00:00 are neighbors in real life but numerically far apart (23 vs 0).

Sine and cosine encoding places them close in geometric space. Think of time as a circle, not a line.

Cyclic encoding preserves temporal continuity.

7.2 Claim Composition Features

To reduce dominance of large monetary values:

injury_share = injury_claim / total_claim_amount
property_share = property_claim / total_claim_amount

Instead of only learning “big claims are risky,” the model can learn patterns like:

“High injury share with low property damage”
or “Unusual claim composition”

injury_share distribution

8. Trees-Optimized Dataset

Tree models can naturally handle:

Non-linear interactions
Raw categorical variables
High cardinality features

Therefore, retained:

auto_make
auto_model
insured_hobbies

Minimal bucketing was applied. Trees decide splits like: “If auto_model = X and collision_type = Y → higher risk.” No coefficient instability.

9. Linear-Optimized Dataset

Linear models assume straight-line relationships. But many real-world variables behave in curves.

Example: Fraud risk might increase sharply below 3 months tenure, then flatten.

A straight line cannot represent that shape well. So we used bucketing.

9.1 Risk Bucketing

Continuous variables were segmented:

customer_tenure_bucket
claim_amount_risk_band
umbrella_limit_bin
policy_annual_premium_bin
vehicle_age_bucket

What bucketing does:

Instead of asking the model to draw one straight line across all values, we group similar ranges together.

Imagine approximating a curve using stairs instead of a single ramp. It’s still simple — but much more flexible.

Bucketing approximates non-linear fraud patterns for linear models.

10. Final Preprocessing Outputs

After all transformations:

Base cleaned dataset:

999 valid records
Leakage removed
Temporal features derived
Missingness encoded

Two modeling-ready datasets:

Trees dataset
- Categorical richness retained
- Share features included
- Interaction-friendly
Linear dataset
- Bucketed non-linear variables
- Structured temporal semantics
- Controlled cardinality

All transformations were deterministic and documented in the preprocessing pipeline.

Key Takeaway

This stage accomplished:

Structural data correction
Leakage prevention
Explicit encoding of behavioral absence
Temporal normalization
Multicollinearity reduction via share features
Model-family-aligned feature architecture

No modeling was performed here. But without this stage, any modeling results would be unreliable.

“Data integrity is not glamorous. But it is the difference between a demo and a deployable system.”

Claim Fraud Detection (ML): Data Integrity & Preprocessing Architecture