Dataset Reality Check — Low Entropy Works

Purpose

This report establishes what the dataset actually represents, how it was constructed, and which structural risks must be acknowledged before any modeling decisions are made.

The goal is not insight generation, but data realism and constraint discovery.

1. Dataset Origin & Scope

The dataset used in this project is insurance_claims.csv, a public dataset published on Mendeley Data (August 22 2023, Version 2, DOI 10.17632/992mh7dk9y.2) and aggregated from multiple insurance providers.

See original dataset →

Each row represents an individual insurance claim across vehicle, property, and personal injury domains. Sensitive personal identifiers were anonymized or masked, preserving structural realism while protecting privacy.

The dataset reflects post-hoc claim assessments, meaning fraud labels represent institutional decisions rather than objective ground truth.

2. Dataset Size, Schema & Target Distribution

Raw size: 1,000 rows × 40 columns
Cleaned size: 999 rows (1 invalid temporal record removed)
Target variable: fraud_reported (Y/N)

Class Imbalance

24.7 % fraudulent claims
75.3 % legitimate claims
Approximate 3 : 1 ratio

A chi-square test confirms the imbalance is statistically significant (p < 0.001).

Implication:
Accuracy is misleading. Evaluation must be precision–recall oriented, prioritizing recall of fraud under constrained false positives.

3. Feature Families

Features fall into four broad categories:

Customer attributes
Age, education, occupation, hobbies
Policy details
Deductibles, CSL limits, annual premiums, policy tenure, state
Incident characteristics
Collision type, severity, authorities contacted, police report availability
Financial variables
Total claim amount and sub-components (vehicle, property, injury)

This mix introduces heterogeneous scales, cardinalities, and semantic meanings, requiring disciplined preprocessing.

4. Missing Values & Structural Patterns

Several categorical variables contain missing or "?" values:

collision_type
property_damage
police_report_available
authorities_contacted

Missingness is non-random and disproportionately associated with fraudulent claims (≈ 18 % missing collision data overall).

Interpretation:
Incomplete documentation is not noise — it reflects ambiguity or avoidance and must be preserved as signal rather than imputed away.

5. Label Uncertainty

The fraud_reported label reflects outcomes of manual investigation combined with automated checks, not objective truth.

Consequences:

Labels may encode institutional bias
Some fraud-relevant information may have been unavailable at claim submission
Certain features risk indirectly encoding investigation outcomes

This necessitates explicit leakage control and conservative interpretation of model outputs.

6. Obvious Red Flags Identified Early

Initial auditing revealed several structural risks:

Empty column: _c39 (dropped)
Unique identifiers: policy_number, incident_location (dropped)
Near-identifiers: insured_zip (≈ 1 : 1 with city/state)
Temporal inconsistencies: one policy bind date occurring after incident
Financial anomalies: negative umbrella limit values

All were addressed before exploration to prevent silent leakage.

7. Visual Evidence (Minimal & Targeted)

Included visuals for this report only:

Target imbalance bar chart — demonstrates minority fraud class
Missingness heatmap (optional) — highlights non-random absence patterns

All other exploratory visuals are intentionally deferred to subsequent reports.

Key Takeaway

This dataset is:

Structurally rich
Moderately imbalanced
Semantically heterogeneous
Label-uncertain
Leakage-prone if handled naively

Understanding these constraints is mandatory before attempting feature engineering or modeling. This report defines the ground truth of the data, not the ambition of the model.

“This report is clean, senior, and defensible. It reads like someone who audits data before trusting it.”