Purpose
This report establishes what the dataset actually represents, how it was constructed, and which structural risks must be acknowledged before any modeling decisions are made.
The goal is not insight generation, but data realism and constraint discovery.
1. Dataset Origin & Scope
The dataset used in this project is insurance_claims.csv, a public dataset published on Mendeley Data (August 22 2023, Version 2, DOI 10.17632/992mh7dk9y.2) and aggregated from multiple insurance providers.
See original dataset →Each row represents an individual insurance claim across vehicle, property, and personal injury domains. Sensitive personal identifiers were anonymized or masked, preserving structural realism while protecting privacy.
The dataset reflects post-hoc claim assessments, meaning fraud labels represent institutional decisions rather than objective ground truth.
2. Dataset Size, Schema & Target Distribution
- Raw size: 1,000 rows × 40 columns
- Cleaned size: 999 rows (1 invalid temporal record removed)
- Target variable:
fraud_reported(Y/N)
Class Imbalance
- 24.7 % fraudulent claims
- 75.3 % legitimate claims
- Approximate 3 : 1 ratio
A chi-square test confirms the imbalance is statistically significant (p < 0.001).
Accuracy is misleading. Evaluation must be precision–recall oriented, prioritizing recall of fraud under constrained false positives.
3. Feature Families
Features fall into four broad categories:
- Customer attributes
Age, education, occupation, hobbies - Policy details
Deductibles, CSL limits, annual premiums, policy tenure, state - Incident characteristics
Collision type, severity, authorities contacted, police report availability - Financial variables
Total claim amount and sub-components (vehicle, property, injury)
This mix introduces heterogeneous scales, cardinalities, and semantic meanings, requiring disciplined preprocessing.
4. Missing Values & Structural Patterns
Several categorical variables contain missing or "?" values:
collision_typeproperty_damagepolice_report_availableauthorities_contacted
Missingness is non-random and disproportionately associated with fraudulent claims (≈ 18 % missing collision data overall).
Interpretation:
Incomplete documentation is not noise — it reflects ambiguity or
avoidance and must be preserved as signal rather than imputed away.
5. Label Uncertainty
The fraud_reported label reflects outcomes of manual investigation combined with
automated checks, not objective truth.
Consequences:
- Labels may encode institutional bias
- Some fraud-relevant information may have been unavailable at claim submission
- Certain features risk indirectly encoding investigation outcomes
This necessitates explicit leakage control and conservative interpretation of model outputs.
6. Obvious Red Flags Identified Early
Initial auditing revealed several structural risks:
- Empty column:
_c39(dropped) - Unique identifiers:
policy_number,incident_location(dropped) - Near-identifiers:
insured_zip(≈ 1 : 1 with city/state) - Temporal inconsistencies: one policy bind date occurring after incident
- Financial anomalies: negative umbrella limit values
All were addressed before exploration to prevent silent leakage.
7. Visual Evidence (Minimal & Targeted)
Included visuals for this report only:
- Target imbalance bar chart — demonstrates minority fraud class
- Missingness heatmap (optional) — highlights non-random absence patterns
All other exploratory visuals are intentionally deferred to subsequent reports.
Key Takeaway
This dataset is:
- Structurally rich
- Moderately imbalanced
- Semantically heterogeneous
- Label-uncertain
- Leakage-prone if handled naively
Understanding these constraints is mandatory before attempting feature engineering or modeling. This report defines the ground truth of the data, not the ambition of the model.