Purpose

This report documents the feature engineering strategy applied after data cleaning and exploratory mapping.

The objective at this stage was to:

  • Translate domain patterns into structured variables
  • Preserve meaningful signal discovered during exploration
  • Control multicollinearity
  • Encode missingness explicitly
  • Design two different feature representations aligned with model behavior

The dataset was transformed into:

  • A shared core feature layer
  • A Linear-optimized dataset
  • A Tree-optimized dataset

Each design decision is documented below.

. . .

3A — Core Feature Engineering (Shared Across Pipelines)

These features were created before model specialization and exist in both pipelines.

They convert raw structural information into usable signal.

Source: Step 1 Feature Engineering sections 3.1–3.3

. . .

1. Claim Composition Features

injury_share

[
injury_claim / total_claim_amount
]

property_share

[
property_claim / total_claim_amount
]

Why this matters

Raw monetary variables are highly correlated (see Step 1 table, p.6–7).

If total claim is large, sub-claims are also large. Feeding both raw creates dominance and redundancy.

By using shares, we shift focus from “how big” to “how structured.”

This is like comparing:

  • Someone spending €1000 total
    vs
  • Someone spending 80% of it on one suspicious category

The second case carries behavioral structure.

Code Illustration

df["injury_share"] = df["injury_claim"] / df["total_claim_amount"]
df["property_share"] = df["property_claim"] / df["total_claim_amount"]
. . .
“Fraud cases show composition differences beyond raw magnitude.”
. . .

2. Temporal Derivations

days_since_bind

Policy age at time of incident.

df["days_since_bind"] = (
    df["incident_date"] - df["policy_bind_date"]
).dt.days

This captures opportunistic early fraud behavior identified during exploration.

A raw date is a timestamp.
Policy age is behavioral timing.

. . .

vehicle_age

Derived from auto_year.

df["vehicle_age"] = 2015 - df["auto_year"]

Manufacturing year was removed to prevent proxy effects (Step 1 audit table, p.4).

Vehicle age carries depreciation and exposure meaning.

. . .

3. Cyclical Encoding of Time

Hours wrap around.
23:00 is close to 00:00.

Linear models treat 23 and 0 as far apart unless we encode circularity.

hour_sin / hour_cos

df["hour_sin"] = np.sin(2 * np.pi * df["incident_hour"] / 24)
df["hour_cos"] = np.cos(2 * np.pi * df["incident_hour"] / 24)

Think of this as mapping time onto a clock circle rather than a straight ruler.

. . .

4. Missingness Flags

Missing values were preserved and encoded explicitly (Step 1 cleaning section, p.4–5).

Flags created for:

  • collision_type_missing
  • property_damage_missing
  • police_report_available_missing
  • authorities_contacted_missing
df["collision_type_missing"] = (df["collision_type"] == "Missing").astype(int)

Missingness here reflects incomplete documentation patterns correlated with fraud.

In fraud systems, absence of documentation is often information.

. . .

5. CSL Parsing

policy_csl values like "250/500" were parsed into:

  • csl_per_person
  • csl_per_incident
df["csl_per_person"] = df["policy_csl"].str.split("/").str[0].astype(int)
df["csl_per_incident"] = df["policy_csl"].str.split("/").str[1].astype(int)

This converts structured strings into usable numeric exposure variables.

. . .

Why Feature Engineering Exists

Before splitting pipelines, it’s important to understand:

Different models “see” the world differently.

Linear models draw straight decision lines.
Tree models carve stepwise decision regions.

You don’t feed both the same shape and expect identical behavior.

It’s like giving:

  • A ruler to measure curves
  • Or giving a knife that can carve corners

The tool determines how you prepare the material.

Now we split.

. . .

3B — Linear Model Feature Engineering

Reshaping reality so a straight-line model can operate

Source: Step 1 Section 3.3 + Linear dataset description

Linear models assume additive relationships.

They cannot:

  • Learn automatic interaction terms
  • Detect non-linear thresholds
  • Handle extreme skew gracefully

So we transform reality into approximated segments.

. . .

1. Risk Segmentation Buckets

Continuous variables were discretized:

  • claim_amount_risk_band
  • customer_tenure_bucket
  • umbrella_limit_bin
  • policy_annual_premium_bin
  • days_since_bind_bin
  • days_since_bind_bucket_final

This converts curved risk patterns into step functions.

Example Concept

If fraud spikes under 30 days of tenure, then:
Instead of hoping the model learns a curve, we create explicit tenure buckets.

. . .
“Bucketed tenure stabilizes linear separability.”
. . .

2. Holiday & Calendar Features

Linear dataset includes:

  • incident_dow
  • incident_weekend
  • policy_bind_year
  • policy_bind_month
  • policy_bind_quarter
  • is_holiday
  • holiday_window_2d

These create additive behavioral markers.

Trees can find interactions between month and severity automatically.
Linear models require explicit signals.

. . .

3. Premium Risk Proxy

Feature:

  • is_high_risk_premium

Derived from premium threshold logic (Step 1 table p.6).

This compresses non-linear premium behavior into a binary additive factor.

. . .

4. Categorical Encoding Strategy

For linear models:

  • One-hot encoding applied
  • High-cardinality features excluded:
    • auto_model
    • insured_hobbies

Reason:
One-hot on 39 car models creates sparse instability.

Trees tolerate sparsity.
Linear coefficients explode.

. . .
“High-cardinality expansion increases dimensional instability for linear models.”
. . .

3C — Tree-Optimized Feature Architecture

Preserving structure for interaction discovery

Source: Step 1 Section 3.2

Trees operate differently.

They:

  • Split data recursively
  • Learn thresholds automatically
  • Capture interactions without manual feature crossing

So we preserve richness instead of compressing it.

. . .

1. Retaining High-Cardinality Features

Kept for Trees:

  • auto_make
  • auto_model
  • insured_hobbies

Trees can branch on:
“auto_model = X AND severity = Major”
without manual encoding of interactions.

. . .

2. Preserving Continuous Features

No aggressive bucketing applied.

Trees benefit from raw numeric structure.

They internally discover optimal split points.

. . .

3. Hour Buckets (Coarse Behavioral Context)

hour_bin_4:

  • Night
  • Morning
  • Afternoon
  • Evening

Coarse grouping supports behavioral clustering while retaining hour_sin/cos in shared layer.

. . .
“Tree models learn hierarchical interaction without manual crossing.”
. . .

Why Two Representations Matter

Linear Model World:

Decision boundary is a flat surface. If risk curves, you approximate it using steps.

Tree Model World:

Decision boundary is staircase-like. Model adapts naturally to curvature.

Same data. Different geometry. Different preparation.

This is architectural thinking — not just feature counting.

. . .

Summary of Engineering Philosophy

Shared Layer:

  • Normalize structure
  • Encode missingness
  • Convert temporal signals
  • Remove leakage

Linear Dataset:

  • Add buckets
  • Reduce dimensional explosion
  • Encode interactions explicitly

Tree Dataset:

  • Preserve cardinality
  • Retain non-linear numeric structure
  • Allow interaction discovery

This approach produced two modeling-ready datasets with different geometric assumptions, both grounded in the same cleaned base.