Purpose
This report documents the feature engineering strategy applied after data cleaning and exploratory mapping.
The objective at this stage was to:
- Translate domain patterns into structured variables
- Preserve meaningful signal discovered during exploration
- Control multicollinearity
- Encode missingness explicitly
- Design two different feature representations aligned with model behavior
The dataset was transformed into:
- A shared core feature layer
- A Linear-optimized dataset
- A Tree-optimized dataset
Each design decision is documented below.
3A — Core Feature Engineering (Shared Across Pipelines)
These features were created before model specialization and exist in both pipelines.
They convert raw structural information into usable signal.
Source: Step 1 Feature Engineering sections 3.1–3.3
1. Claim Composition Features
injury_share
[
injury_claim / total_claim_amount
]
property_share
[
property_claim / total_claim_amount
]
Why this matters
Raw monetary variables are highly correlated (see Step 1 table, p.6–7).
If total claim is large, sub-claims are also large. Feeding both raw creates dominance and redundancy.
By using shares, we shift focus from “how big” to “how structured.”
This is like comparing:
- Someone spending €1000 total
vs - Someone spending 80% of it on one suspicious category
The second case carries behavioral structure.
Code Illustration
df["injury_share"] = df["injury_claim"] / df["total_claim_amount"]
df["property_share"] = df["property_claim"] / df["total_claim_amount"]
2. Temporal Derivations
days_since_bind
Policy age at time of incident.
df["days_since_bind"] = (
df["incident_date"] - df["policy_bind_date"]
).dt.days
This captures opportunistic early fraud behavior identified during exploration.
A raw date is a timestamp.
Policy age is behavioral timing.
vehicle_age
Derived from auto_year.
df["vehicle_age"] = 2015 - df["auto_year"]
Manufacturing year was removed to prevent proxy effects (Step 1 audit table, p.4).
Vehicle age carries depreciation and exposure meaning.
3. Cyclical Encoding of Time
Hours wrap around.
23:00 is close to 00:00.
Linear models treat 23 and 0 as far apart unless we encode circularity.
hour_sin / hour_cos
df["hour_sin"] = np.sin(2 * np.pi * df["incident_hour"] / 24)
df["hour_cos"] = np.cos(2 * np.pi * df["incident_hour"] / 24)
Think of this as mapping time onto a clock circle rather than a straight ruler.
4. Missingness Flags
Missing values were preserved and encoded explicitly (Step 1 cleaning section, p.4–5).
Flags created for:
- collision_type_missing
- property_damage_missing
- police_report_available_missing
- authorities_contacted_missing
df["collision_type_missing"] = (df["collision_type"] == "Missing").astype(int)
Missingness here reflects incomplete documentation patterns correlated with fraud.
In fraud systems, absence of documentation is often information.
5. CSL Parsing
policy_csl values like "250/500" were parsed into:
- csl_per_person
- csl_per_incident
df["csl_per_person"] = df["policy_csl"].str.split("/").str[0].astype(int)
df["csl_per_incident"] = df["policy_csl"].str.split("/").str[1].astype(int)
This converts structured strings into usable numeric exposure variables.
Why Feature Engineering Exists
Before splitting pipelines, it’s important to understand:
Different models “see” the world differently.
Linear models draw straight decision lines.
Tree models carve stepwise decision regions.
You don’t feed both the same shape and expect identical behavior.
It’s like giving:
- A ruler to measure curves
- Or giving a knife that can carve corners
The tool determines how you prepare the material.
Now we split.
3B — Linear Model Feature Engineering
Reshaping reality so a straight-line model can operate
Source: Step 1 Section 3.3 + Linear dataset description
Linear models assume additive relationships.
They cannot:
- Learn automatic interaction terms
- Detect non-linear thresholds
- Handle extreme skew gracefully
So we transform reality into approximated segments.
1. Risk Segmentation Buckets
Continuous variables were discretized:
- claim_amount_risk_band
- customer_tenure_bucket
- umbrella_limit_bin
- policy_annual_premium_bin
- days_since_bind_bin
- days_since_bind_bucket_final
This converts curved risk patterns into step functions.
Example Concept
If fraud spikes under 30 days of tenure, then:
Instead of hoping the model learns a curve, we create
explicit tenure buckets.
2. Holiday & Calendar Features
Linear dataset includes:
- incident_dow
- incident_weekend
- policy_bind_year
- policy_bind_month
- policy_bind_quarter
- is_holiday
- holiday_window_2d
These create additive behavioral markers.
Trees can find interactions between month and severity automatically.
Linear models require explicit
signals.
3. Premium Risk Proxy
Feature:
- is_high_risk_premium
Derived from premium threshold logic (Step 1 table p.6).
This compresses non-linear premium behavior into a binary additive factor.
4. Categorical Encoding Strategy
For linear models:
- One-hot encoding applied
- High-cardinality features excluded:
- auto_model
- insured_hobbies
Reason:
One-hot on 39 car models creates sparse instability.
Trees tolerate sparsity.
Linear coefficients explode.
3C — Tree-Optimized Feature Architecture
Preserving structure for interaction discovery
Source: Step 1 Section 3.2
Trees operate differently.
They:
- Split data recursively
- Learn thresholds automatically
- Capture interactions without manual feature crossing
So we preserve richness instead of compressing it.
1. Retaining High-Cardinality Features
Kept for Trees:
- auto_make
- auto_model
- insured_hobbies
Trees can branch on:
“auto_model = X AND severity = Major”
without manual encoding of interactions.
2. Preserving Continuous Features
No aggressive bucketing applied.
Trees benefit from raw numeric structure.
They internally discover optimal split points.
3. Hour Buckets (Coarse Behavioral Context)
hour_bin_4:
- Night
- Morning
- Afternoon
- Evening
Coarse grouping supports behavioral clustering while retaining hour_sin/cos in shared layer.
Why Two Representations Matter
Linear Model World:
Decision boundary is a flat surface. If risk curves, you approximate it using steps.
Tree Model World:
Decision boundary is staircase-like. Model adapts naturally to curvature.
Same data. Different geometry. Different preparation.
This is architectural thinking — not just feature counting.
Summary of Engineering Philosophy
Shared Layer:
- Normalize structure
- Encode missingness
- Convert temporal signals
- Remove leakage
Linear Dataset:
- Add buckets
- Reduce dimensional explosion
- Encode interactions explicitly
Tree Dataset:
- Preserve cardinality
- Retain non-linear numeric structure
- Allow interaction discovery
This approach produced two modeling-ready datasets with different geometric assumptions, both grounded in the same cleaned base.