Insurance Claims Fraud Detection — Data Preprocessing Report

Executive Summary

The dataset was cleaned, standardized, and enriched with domain‑aligned features (temporal structure, claim composition, vehicle age, and time‑of‑day cycles). We removed identifiers and artifacts, encoded missingness, and validated the integrity of critical targets and features. The result is a compact, information‑dense table designed for downstream fraud modeling.

Target balance after preprocessing (class imbalance matters for fraud).

1. Data Loading & Initial Assessment

1.1 Dataset Characteristics

Original Size: 1,000 rows × 40 columns
Memory Usage: ~1.28 MB (in‑memory friendly)
Data Types: mixed numerical (≈17) and categorical (≈21)
Target Variable: fraud_reported (Y/N → binary indicator)

1.2 Initial Data Quality Issues

Empty CSV artifact column (_c39) with 100% missing values
Placeholder tokens ("?") in categorical fields
Mixed casing/whitespace in text columns
Negative values in umbrella_limit (domain violation)
High‑cardinality quasi‑identifiers (e.g., near‑unique ZIPs/locations)

2. Data Cleaning & Quality Assurance

2.1 Structural Cleanup

Removed empty column: dropped _c39 (CSV export artifact).
Normalized missing values: converted "?" → NaN in collision_type, property_damage, police_report_available.

2.2 Text Standardization

Trimmed leading/trailing whitespace across all text columns.
Normalized case for categorical values (consistent tokens).
Audited other missing representations (NA/NULL/None).

2.3 Domain Logic Enforcement

Umbrella coverage: fixed negatives via absolute value on umbrella_limit.
Target encoding: mapped fraud_reported Y/N → {1,0} with compact integer type.

3. Feature Engineering

3.1 Temporal Features

Date parsing: converted policy_bind_date and incident_date to datetime.
Policy duration: days_since_bind (incident date − bind date).
Calendar signals: incident_month, incident_dow (0=Mon), incident_weekend (binary).
Removed constant: dropped incident_year (all 2015).

# Core transforms (excerpt)
              df['umbrella_limit'] = df['umbrella_limit'].abs()  # domain fix
              df['target'] = df['fraud_reported'].map({'Y':1,'N':0}).astype('int8')

              # Temporal
              df['policy_bind_date'] = pd.to_datetime(df['policy_bind_date'])
              df['incident_date'] = pd.to_datetime(df['incident_date'])
              df['days_since_bind'] = (df['incident_date'] - df['policy_bind_date']).dt.days
              df['incident_month'] = df['incident_date'].dt.month
              df['incident_dow'] = df['incident_date'].dt.weekday
              df['incident_weekend'] = (df['incident_dow'] >= 5).astype('int8')

              # Claim composition
              total = df['total_claim_amount'].replace(0, np.nan)
              df['injury_share']   = df['injury_claim']   / total
              df['property_share'] = df['property_claim'] / total

              # Vehicle & hour cyclic
              df['vehicle_age'] = 2015 - df['auto_year']
              h = df['incident_hour_of_the_day']
              df['hour_sin'] = np.sin(2*np.pi*h/24)
              df['hour_cos'] = np.cos(2*np.pi*h/24)

3.2 Claim Amount Composition

Explored multiple representations (component totals vs. shares). Selected total + component shares to retain scale while capturing composition:

injury_share = injury_claim / total_claim_amount
property_share = property_claim / total_claim_amount
Protected against division by zero (NaN handling).

3.3 Vehicle Information

Vehicle age: vehicle_age = 2015 − auto_year.
Dropped raw auto_year after derivation to avoid redundancy.

3.4 Time‑of‑Day Cyclic Encoding

hour_sin = sin(2π × hour/24), hour_cos = cos(2π × hour/24).
Added 4 interpreter‑friendly buckets: Night (0–5), Morning (6–11), Afternoon (12–17), Evening (18–23).

Incidents by 4 time-of-day buckets (Night / Morning / Afternoon / Evening).

Cyclic encoding shows continuity of hour-of-day on the unit circle (sin/cos).

4. High‑Cardinality & Quasi‑Identifier Management

4.1 Column Removal Strategy

Dropped identifiers/quasi‑identifiers: policy_number, incident_location, insured_zip.
Removed raw date columns post‑feature‑extraction.

4.2 Cardinality Analysis

Retained genuinely grouped categoricals with reasonable uniqueness ratios: auto_model (~39 cats), insured_hobbies (~20 cats), auto_make, insured_occupation (~14 cats each).

5. Missing Value Treatment

5.1 Strategy

Treated missingness as potentially informative (common in fraud settings). Key columns with notable absence: collision_type, authorities_contacted, property_damage, police_report_available.

5.2 Two‑Pronged Approach

Missing flags: binary _missing indicators.
Explicit category: filled NaN with "Missing" to ensure encoder compatibility.

6. Final Dataset Composition

6.1 Model‑Ready Feature Set

Total features: ~38 columns (net of drops + engineering)
Numerical: ~20 (incl. engineered)
Categorical: ~18 (incl. time buckets)
Target: binary target (fraud indicator)

Final mix of numerical vs categorical vs engineered features.

6.2 Data Integrity Checks

Validated target ∈ {0,1}.
Confirmed completeness for total_claim_amount, share features, days_since_bind, and temporal encodings.

6.3 Memory Optimization

Down‑cast to compact numerics (int8, int32) where safe.
Kept object dtype only for true categoricals.

7. Key Preprocessing Decisions & Rationale

7.1 Feature Selection Philosophy

Removed: identifiers, near‑constant, and redundant columns.
Retained: genuine categorical groupings with interpretable cardinality.
Enhanced: temporal intelligence, claim composition, and vehicle age.

7.2 Missing Data Strategy

Captured absence explicitly (flags + category).
Reflected domain reality that missingness itself can be a fraud signal.

7.3 Engineering Priorities

Interpretability: human‑readable buckets alongside mathematical transforms.
Signal preservation: ratios + totals to keep scale and composition.
Temporal patterns: duration and cyclic encodings.

Appendix: Feature Dictionary

8. Next Steps for Modeling

Categorical encoding: One‑Hot / Target / Frequency (evaluate per model family).
Scaling: standardize numerics for linear models (within pipeline only).
Train‑test split: stratified by target to respect class imbalance.
Metrics focus: precision‑recall curves, PR‑AUC, and F₂ for recall‑weighted scenarios.