Executive Summary

The dataset was cleaned, standardized, and enriched with domain‑aligned features (temporal structure, claim composition, vehicle age, and time‑of‑day cycles). We removed identifiers and artifacts, encoded missingness, and validated the integrity of critical targets and features. The result is a compact, information‑dense table designed for downstream fraud modeling.

Target balance after preprocessing (class imbalance matters for fraud).

1. Data Loading & Initial Assessment

1.1 Dataset Characteristics

  • Original Size: 1,000 rows × 40 columns
  • Memory Usage: ~1.28 MB (in‑memory friendly)
  • Data Types: mixed numerical (≈17) and categorical (≈21)
  • Target Variable: fraud_reported (Y/N → binary indicator)

1.2 Initial Data Quality Issues

  • Empty CSV artifact column (_c39) with 100% missing values
  • Placeholder tokens ("?") in categorical fields
  • Mixed casing/whitespace in text columns
  • Negative values in umbrella_limit (domain violation)
  • High‑cardinality quasi‑identifiers (e.g., near‑unique ZIPs/locations)

2. Data Cleaning & Quality Assurance

2.1 Structural Cleanup

  • Removed empty column: dropped _c39 (CSV export artifact).
  • Normalized missing values: converted "?"NaN in collision_type, property_damage, police_report_available.

2.2 Text Standardization

  • Trimmed leading/trailing whitespace across all text columns.
  • Normalized case for categorical values (consistent tokens).
  • Audited other missing representations (NA/NULL/None).

2.3 Domain Logic Enforcement

  • Umbrella coverage: fixed negatives via absolute value on umbrella_limit.
  • Target encoding: mapped fraud_reported Y/N → {1,0} with compact integer type.

3. Feature Engineering

3.1 Temporal Features

  • Date parsing: converted policy_bind_date and incident_date to datetime.
  • Policy duration: days_since_bind (incident date − bind date).
  • Calendar signals: incident_month, incident_dow (0=Mon), incident_weekend (binary).
  • Removed constant: dropped incident_year (all 2015).
# Core transforms (excerpt)
              df['umbrella_limit'] = df['umbrella_limit'].abs()  # domain fix
              df['target'] = df['fraud_reported'].map({'Y':1,'N':0}).astype('int8')

              # Temporal
              df['policy_bind_date'] = pd.to_datetime(df['policy_bind_date'])
              df['incident_date'] = pd.to_datetime(df['incident_date'])
              df['days_since_bind'] = (df['incident_date'] - df['policy_bind_date']).dt.days
              df['incident_month'] = df['incident_date'].dt.month
              df['incident_dow'] = df['incident_date'].dt.weekday
              df['incident_weekend'] = (df['incident_dow'] >= 5).astype('int8')

              # Claim composition
              total = df['total_claim_amount'].replace(0, np.nan)
              df['injury_share']   = df['injury_claim']   / total
              df['property_share'] = df['property_claim'] / total

              # Vehicle & hour cyclic
              df['vehicle_age'] = 2015 - df['auto_year']
              h = df['incident_hour_of_the_day']
              df['hour_sin'] = np.sin(2*np.pi*h/24)
              df['hour_cos'] = np.cos(2*np.pi*h/24)

3.2 Claim Amount Composition

Explored multiple representations (component totals vs. shares). Selected total + component shares to retain scale while capturing composition:

  • injury_share = injury_claim / total_claim_amount
  • property_share = property_claim / total_claim_amount
  • Protected against division by zero (NaN handling).

3.3 Vehicle Information

  • Vehicle age: vehicle_age = 2015 − auto_year.
  • Dropped raw auto_year after derivation to avoid redundancy.

3.4 Time‑of‑Day Cyclic Encoding

  • hour_sin = sin(2π × hour/24), hour_cos = cos(2π × hour/24).
  • Added 4 interpreter‑friendly buckets: Night (0–5), Morning (6–11), Afternoon (12–17), Evening (18–23).
Incidents by 4 time-of-day buckets (Night / Morning / Afternoon / Evening).
Cyclic encoding shows continuity of hour-of-day on the unit circle (sin/cos).

4. High‑Cardinality & Quasi‑Identifier Management

4.1 Column Removal Strategy

  • Dropped identifiers/quasi‑identifiers: policy_number, incident_location, insured_zip.
  • Removed raw date columns post‑feature‑extraction.

4.2 Cardinality Analysis

Retained genuinely grouped categoricals with reasonable uniqueness ratios: auto_model (~39 cats), insured_hobbies (~20 cats), auto_make, insured_occupation (~14 cats each).

5. Missing Value Treatment

5.1 Strategy

Treated missingness as potentially informative (common in fraud settings). Key columns with notable absence: collision_type, authorities_contacted, property_damage, police_report_available.

5.2 Two‑Pronged Approach

  • Missing flags: binary _missing indicators.
  • Explicit category: filled NaN with "Missing" to ensure encoder compatibility.

6. Final Dataset Composition

6.1 Model‑Ready Feature Set

  • Total features: ~38 columns (net of drops + engineering)
  • Numerical: ~20 (incl. engineered)
  • Categorical: ~18 (incl. time buckets)
  • Target: binary target (fraud indicator)
Final mix of numerical vs categorical vs engineered features.

6.2 Data Integrity Checks

  • Validated target ∈ {0,1}.
  • Confirmed completeness for total_claim_amount, share features, days_since_bind, and temporal encodings.

6.3 Memory Optimization

  • Down‑cast to compact numerics (int8, int32) where safe.
  • Kept object dtype only for true categoricals.

7. Key Preprocessing Decisions & Rationale

7.1 Feature Selection Philosophy

  • Removed: identifiers, near‑constant, and redundant columns.
  • Retained: genuine categorical groupings with interpretable cardinality.
  • Enhanced: temporal intelligence, claim composition, and vehicle age.

7.2 Missing Data Strategy

  • Captured absence explicitly (flags + category).
  • Reflected domain reality that missingness itself can be a fraud signal.

7.3 Engineering Priorities

  • Interpretability: human‑readable buckets alongside mathematical transforms.
  • Signal preservation: ratios + totals to keep scale and composition.
  • Temporal patterns: duration and cyclic encodings.

Appendix: Feature Dictionary

8. Next Steps for Modeling

  1. Categorical encoding: One‑Hot / Target / Frequency (evaluate per model family).
  2. Scaling: standardize numerics for linear models (within pipeline only).
  3. Train‑test split: stratified by target to respect class imbalance.
  4. Metrics focus: precision‑recall curves, PR‑AUC, and F2 for recall‑weighted scenarios.