Purpose

This report establishes what the dataset actually represents, how it was constructed, and which structural risks must be acknowledged before any modeling decisions are made.

The goal is not insight generation, but data realism and constraint discovery.

1. Dataset Origin & Scope

The dataset used in this project is insurance_claims.csv, a public dataset published on Mendeley Data (August 22 2023, Version 2, DOI 10.17632/992mh7dk9y.2) and aggregated from multiple insurance providers.

See original dataset →

Each row represents an individual insurance claim across vehicle, property, and personal injury domains. Sensitive personal identifiers were anonymized or masked, preserving structural realism while protecting privacy.

The dataset reflects post-hoc claim assessments, meaning fraud labels represent institutional decisions rather than objective ground truth.

2. Dataset Size, Schema & Target Distribution

  • Raw size: 1,000 rows × 40 columns
  • Cleaned size: 999 rows (1 invalid temporal record removed)
  • Target variable: fraud_reported (Y/N)

Class Imbalance

  • 24.7 % fraudulent claims
  • 75.3 % legitimate claims
  • Approximate 3 : 1 ratio

A chi-square test confirms the imbalance is statistically significant (p < 0.001).

Implication:
Accuracy is misleading. Evaluation must be precision–recall oriented, prioritizing recall of fraud under constrained false positives.

3. Feature Families

Features fall into four broad categories:

  • Customer attributes
    Age, education, occupation, hobbies
  • Policy details
    Deductibles, CSL limits, annual premiums, policy tenure, state
  • Incident characteristics
    Collision type, severity, authorities contacted, police report availability
  • Financial variables
    Total claim amount and sub-components (vehicle, property, injury)

This mix introduces heterogeneous scales, cardinalities, and semantic meanings, requiring disciplined preprocessing.

4. Missing Values & Structural Patterns

Several categorical variables contain missing or "?" values:

  • collision_type
  • property_damage
  • police_report_available
  • authorities_contacted

Missingness is non-random and disproportionately associated with fraudulent claims (≈ 18 % missing collision data overall).

Interpretation:
Incomplete documentation is not noise — it reflects ambiguity or avoidance and must be preserved as signal rather than imputed away.

5. Label Uncertainty

The fraud_reported label reflects outcomes of manual investigation combined with automated checks, not objective truth.

Consequences:

  • Labels may encode institutional bias
  • Some fraud-relevant information may have been unavailable at claim submission
  • Certain features risk indirectly encoding investigation outcomes

This necessitates explicit leakage control and conservative interpretation of model outputs.

6. Obvious Red Flags Identified Early

Initial auditing revealed several structural risks:

  • Empty column: _c39 (dropped)
  • Unique identifiers: policy_number, incident_location (dropped)
  • Near-identifiers: insured_zip (≈ 1 : 1 with city/state)
  • Temporal inconsistencies: one policy bind date occurring after incident
  • Financial anomalies: negative umbrella limit values

All were addressed before exploration to prevent silent leakage.

7. Visual Evidence (Minimal & Targeted)

Included visuals for this report only:

  • Target imbalance bar chart — demonstrates minority fraud class
  • Missingness heatmap (optional) — highlights non-random absence patterns

All other exploratory visuals are intentionally deferred to subsequent reports.

Key Takeaway

This dataset is:

  • Structurally rich
  • Moderately imbalanced
  • Semantically heterogeneous
  • Label-uncertain
  • Leakage-prone if handled naively

Understanding these constraints is mandatory before attempting feature engineering or modeling. This report defines the ground truth of the data, not the ambition of the model.

“This report is clean, senior, and defensible. It reads like someone who audits data before trusting it.”