Executive Summary
The dataset was cleaned, standardized, and enriched with domain‑aligned features (temporal structure, claim composition, vehicle age, and time‑of‑day cycles). We removed identifiers and artifacts, encoded missingness, and validated the integrity of critical targets and features. The result is a compact, information‑dense table designed for downstream fraud modeling.
1. Data Loading & Initial Assessment
1.1 Dataset Characteristics
- Original Size: 1,000 rows × 40 columns
- Memory Usage: ~1.28 MB (in‑memory friendly)
- Data Types: mixed numerical (≈17) and categorical (≈21)
- Target Variable:
fraud_reported(Y/N → binary indicator)
1.2 Initial Data Quality Issues
- Empty CSV artifact column (
_c39) with 100% missing values - Placeholder tokens (
"?") in categorical fields - Mixed casing/whitespace in text columns
- Negative values in
umbrella_limit(domain violation) - High‑cardinality quasi‑identifiers (e.g., near‑unique ZIPs/locations)
2. Data Cleaning & Quality Assurance
2.1 Structural Cleanup
- Removed empty column: dropped
_c39(CSV export artifact). - Normalized missing values: converted
"?"→NaNincollision_type,property_damage,police_report_available.
2.2 Text Standardization
- Trimmed leading/trailing whitespace across all text columns.
- Normalized case for categorical values (consistent tokens).
- Audited other missing representations (NA/NULL/None).
2.3 Domain Logic Enforcement
- Umbrella coverage: fixed negatives via absolute value on
umbrella_limit. - Target encoding: mapped
fraud_reportedY/N →{1,0}with compact integer type.
3. Feature Engineering
3.1 Temporal Features
- Date parsing: converted
policy_bind_dateandincident_dateto datetime. - Policy duration:
days_since_bind(incident date − bind date). - Calendar signals:
incident_month,incident_dow(0=Mon),incident_weekend(binary). - Removed constant: dropped
incident_year(all 2015).
# Core transforms (excerpt)
df['umbrella_limit'] = df['umbrella_limit'].abs() # domain fix
df['target'] = df['fraud_reported'].map({'Y':1,'N':0}).astype('int8')
# Temporal
df['policy_bind_date'] = pd.to_datetime(df['policy_bind_date'])
df['incident_date'] = pd.to_datetime(df['incident_date'])
df['days_since_bind'] = (df['incident_date'] - df['policy_bind_date']).dt.days
df['incident_month'] = df['incident_date'].dt.month
df['incident_dow'] = df['incident_date'].dt.weekday
df['incident_weekend'] = (df['incident_dow'] >= 5).astype('int8')
# Claim composition
total = df['total_claim_amount'].replace(0, np.nan)
df['injury_share'] = df['injury_claim'] / total
df['property_share'] = df['property_claim'] / total
# Vehicle & hour cyclic
df['vehicle_age'] = 2015 - df['auto_year']
h = df['incident_hour_of_the_day']
df['hour_sin'] = np.sin(2*np.pi*h/24)
df['hour_cos'] = np.cos(2*np.pi*h/24)
3.2 Claim Amount Composition
Explored multiple representations (component totals vs. shares). Selected total + component shares to retain scale while capturing composition:
injury_share = injury_claim / total_claim_amountproperty_share = property_claim / total_claim_amount- Protected against division by zero (NaN handling).
3.3 Vehicle Information
- Vehicle age:
vehicle_age = 2015 − auto_year. - Dropped raw
auto_yearafter derivation to avoid redundancy.
3.4 Time‑of‑Day Cyclic Encoding
hour_sin = sin(2π × hour/24),hour_cos = cos(2π × hour/24).- Added 4 interpreter‑friendly buckets: Night (0–5), Morning (6–11), Afternoon (12–17), Evening (18–23).
4. High‑Cardinality & Quasi‑Identifier Management
4.1 Column Removal Strategy
- Dropped identifiers/quasi‑identifiers:
policy_number,incident_location,insured_zip. - Removed raw date columns post‑feature‑extraction.
4.2 Cardinality Analysis
Retained genuinely grouped categoricals with reasonable uniqueness ratios:
auto_model (~39 cats), insured_hobbies (~20 cats), auto_make, insured_occupation (~14 cats each).
5. Missing Value Treatment
5.1 Strategy
Treated missingness as potentially informative (common in fraud settings). Key columns with notable absence:
collision_type, authorities_contacted, property_damage, police_report_available.
5.2 Two‑Pronged Approach
- Missing flags: binary
_missingindicators. - Explicit category: filled
NaNwith"Missing"to ensure encoder compatibility.
6. Final Dataset Composition
6.1 Model‑Ready Feature Set
- Total features: ~38 columns (net of drops + engineering)
- Numerical: ~20 (incl. engineered)
- Categorical: ~18 (incl. time buckets)
- Target: binary
target(fraud indicator)
6.2 Data Integrity Checks
- Validated
target ∈ {0,1}. - Confirmed completeness for
total_claim_amount, share features,days_since_bind, and temporal encodings.
6.3 Memory Optimization
- Down‑cast to compact numerics (
int8,int32) where safe. - Kept object dtype only for true categoricals.
7. Key Preprocessing Decisions & Rationale
7.1 Feature Selection Philosophy
- Removed: identifiers, near‑constant, and redundant columns.
- Retained: genuine categorical groupings with interpretable cardinality.
- Enhanced: temporal intelligence, claim composition, and vehicle age.
7.2 Missing Data Strategy
- Captured absence explicitly (flags + category).
- Reflected domain reality that missingness itself can be a fraud signal.
7.3 Engineering Priorities
- Interpretability: human‑readable buckets alongside mathematical transforms.
- Signal preservation: ratios + totals to keep scale and composition.
- Temporal patterns: duration and cyclic encodings.
Appendix: Feature Dictionary
8. Next Steps for Modeling
- Categorical encoding: One‑Hot / Target / Frequency (evaluate per model family).
- Scaling: standardize numerics for linear models (within pipeline only).
- Train‑test split: stratified by target to respect class imbalance.
- Metrics focus: precision‑recall curves, PR‑AUC, and F2 for recall‑weighted scenarios.