Purpose

This report documents how the raw claims dataset was transformed into a clean, leakage-controlled, modeling-ready foundation.

The previous report established what the dataset represents. This report explains how it was engineered into a reliable analytical system.

The focus is:

  • Structural correction
  • Leakage elimination
  • Missingness handling
  • Temporal normalization
  • Dual feature-engineering strategy (Trees vs Linear models)

No modeling yet. Just disciplined preparation.

1. From Raw Records to Clean Base Dataset

The initial dataset contained:

  • 1,000 rows
  • 40 columns
  • Mixed types: numeric, categorical, temporal, identifiers

During preprocessing:

  • 1 invalid temporal record was removed
  • Leakage-prone identifiers were dropped
  • Missing values were harmonized
  • Date fields were standardized
  • Derived variables were created

Final base dataset: 999 structurally valid records.

This stage ensured the dataset could support reliable modeling without hidden shortcuts.

2. Leakage Control: Removing Hidden Answer Keys

Some columns contained information that would allow a model to “cheat.”

Removed:

  • policy_number (unique identifier)
  • incident_location (unique street-level detail)
  • _c39 (empty column)
  • insured_zip (near-identifier, almost 1:1 with city/state)
  • auto_year (replaced with vehicle_age)

Why this matters:

If a feature uniquely identifies a record, the model can memorize fraud labels instead of learning patterns. That produces artificially high accuracy but fails in production.

In short: we removed lookup keys disguised as data.

Columns with near-unique values introduce memorization risk and were removed.

3. Missingness as Signal (Not Noise)

The dataset used both "?" and NaN.

These were harmonized into a unified "Missing" category for:

  • collision_type
  • property_damage
  • police_report_available
  • authorities_contacted

Instead of imputing values (e.g., filling with “No”), we created explicit missingness indicators.

Why?

Because early inspection showed that missing documentation disproportionately appeared in fraudulent cases.

In fraud detection, absence of paperwork is not random. It may reflect ambiguity, evasion, or incomplete reporting.

So we encoded missingness explicitly rather than hiding it.

Missing documentation correlates with elevated fraud probability.

4. Temporal Standardization

Two date fields:

  • policy_bind_date
  • incident_date

Both were converted to datetime format.

We then derived:

days_since_bind = incident_date − policy_bind_date

One record where the policy began after the incident was removed. Raw dates were dropped after transformation.

Why derive instead of keeping raw dates?

Because models do not understand calendar meaning. But they understand measurable quantities like:

  • policy age
  • month
  • weekday
  • weekend
  • holiday proximity

We transformed timestamps into interpretable behavioral signals.

Short policy tenure appears more frequently among fraudulent claims.

5. Numeric Corrections

One anomaly was detected:

  • umbrella_limit contained negative values.

These were corrected using absolute value transformation.

This was not modeling enhancement — just restoring logical consistency.

6. Feature Engineering Strategy: One Dataset Is Not Enough

A key architectural decision in Step 1 was building two separate modeling datasets:

  • Trees dataset
  • Linear dataset

Why two?

Because tree models and linear models interpret structure differently.

7. Shared Feature Engineering

Applied to both pipelines:

7.1 Temporal Features

  • days_since_bind
  • incident_month
  • vehicle_age
  • hour_sin, hour_cos

Cyclical encoding explanation:

Hours repeat every 24 hours. 23:00 and 00:00 are neighbors in real life but numerically far apart (23 vs 0).

Sine and cosine encoding places them close in geometric space. Think of time as a circle, not a line.

Cyclic encoding preserves temporal continuity.

7.2 Claim Composition Features

To reduce dominance of large monetary values:

  • injury_share = injury_claim / total_claim_amount
  • property_share = property_claim / total_claim_amount

Instead of only learning “big claims are risky,” the model can learn patterns like:

  • “High injury share with low property damage”
  • or “Unusual claim composition”
injury_share distribution

8. Trees-Optimized Dataset

Tree models can naturally handle:

  • Non-linear interactions
  • Raw categorical variables
  • High cardinality features

Therefore, retained:

  • auto_make
  • auto_model
  • insured_hobbies

Minimal bucketing was applied. Trees decide splits like: “If auto_model = X and collision_type = Y → higher risk.” No coefficient instability.

9. Linear-Optimized Dataset

Linear models assume straight-line relationships. But many real-world variables behave in curves.

Example: Fraud risk might increase sharply below 3 months tenure, then flatten.

A straight line cannot represent that shape well. So we used bucketing.

9.1 Risk Bucketing

Continuous variables were segmented:

  • customer_tenure_bucket
  • claim_amount_risk_band
  • umbrella_limit_bin
  • policy_annual_premium_bin
  • vehicle_age_bucket

What bucketing does:

Instead of asking the model to draw one straight line across all values, we group similar ranges together.

Imagine approximating a curve using stairs instead of a single ramp. It’s still simple — but much more flexible.

Bucketing approximates non-linear fraud patterns for linear models.

10. Final Preprocessing Outputs

After all transformations:

Base cleaned dataset:

  • 999 valid records
  • Leakage removed
  • Temporal features derived
  • Missingness encoded

Two modeling-ready datasets:

  • Trees dataset
    • Categorical richness retained
    • Share features included
    • Interaction-friendly
  • Linear dataset
    • Bucketed non-linear variables
    • Structured temporal semantics
    • Controlled cardinality

All transformations were deterministic and documented in the preprocessing pipeline.

Key Takeaway

This stage accomplished:

  • Structural data correction
  • Leakage prevention
  • Explicit encoding of behavioral absence
  • Temporal normalization
  • Multicollinearity reduction via share features
  • Model-family-aligned feature architecture

No modeling was performed here. But without this stage, any modeling results would be unreliable.

“Data integrity is not glamorous. But it is the difference between a demo and a deployable system.”