Low Entropy Works — AI & Data Strategy

Purpose

This report documents the feature engineering strategy applied after data cleaning and exploratory mapping.

The objective at this stage was to:

Translate domain patterns into structured variables
Preserve meaningful signal discovered during exploration
Control multicollinearity
Encode missingness explicitly
Design two different feature representations aligned with model behavior

The dataset was transformed into:

A shared core feature layer
A Linear-optimized dataset
A Tree-optimized dataset

Each design decision is documented below.

. . .

3A — Core Feature Engineering (Shared Across Pipelines)

These features were created before model specialization and exist in both pipelines.

They convert raw structural information into usable signal.

Source: Step 1 Feature Engineering sections 3.1–3.3

. . .

1. Claim Composition Features

injury_share

[
injury_claim / total_claim_amount
]

property_share

[
property_claim / total_claim_amount
]

Why this matters

Raw monetary variables are highly correlated (see Step 1 table, p.6–7).

If total claim is large, sub-claims are also large. Feeding both raw creates dominance and redundancy.

By using shares, we shift focus from “how big” to “how structured.”

This is like comparing:

Someone spending €1000 total
vs
Someone spending 80% of it on one suspicious category

The second case carries behavioral structure.

Code Illustration

df["injury_share"] = df["injury_claim"] / df["total_claim_amount"]
df["property_share"] = df["property_claim"] / df["total_claim_amount"]

. . .

“Fraud cases show composition differences beyond raw magnitude.”

. . .

2. Temporal Derivations

days_since_bind

Policy age at time of incident.

df["days_since_bind"] = (
    df["incident_date"] - df["policy_bind_date"]
).dt.days

This captures opportunistic early fraud behavior identified during exploration.

A raw date is a timestamp.
Policy age is behavioral timing.

. . .

vehicle_age

Derived from auto_year.

df["vehicle_age"] = 2015 - df["auto_year"]

Manufacturing year was removed to prevent proxy effects (Step 1 audit table, p.4).

Vehicle age carries depreciation and exposure meaning.

. . .

3. Cyclical Encoding of Time

Hours wrap around.
23:00 is close to 00:00.

Linear models treat 23 and 0 as far apart unless we encode circularity.

hour_sin / hour_cos

df["hour_sin"] = np.sin(2 * np.pi * df["incident_hour"] / 24)
df["hour_cos"] = np.cos(2 * np.pi * df["incident_hour"] / 24)

Think of this as mapping time onto a clock circle rather than a straight ruler.

. . .

4. Missingness Flags

Missing values were preserved and encoded explicitly (Step 1 cleaning section, p.4–5).

Flags created for:

collision_type_missing
property_damage_missing
police_report_available_missing
authorities_contacted_missing

df["collision_type_missing"] = (df["collision_type"] == "Missing").astype(int)

Missingness here reflects incomplete documentation patterns correlated with fraud.

In fraud systems, absence of documentation is often information.

. . .

5. CSL Parsing

policy_csl values like "250/500" were parsed into:

csl_per_person
csl_per_incident

df["csl_per_person"] = df["policy_csl"].str.split("/").str[0].astype(int)
df["csl_per_incident"] = df["policy_csl"].str.split("/").str[1].astype(int)

This converts structured strings into usable numeric exposure variables.

. . .

Why Feature Engineering Exists

Before splitting pipelines, it’s important to understand:

Different models “see” the world differently.

Linear models draw straight decision lines.
Tree models carve stepwise decision regions.

You don’t feed both the same shape and expect identical behavior.

It’s like giving:

A ruler to measure curves
Or giving a knife that can carve corners

The tool determines how you prepare the material.

Now we split.

. . .

3B — Linear Model Feature Engineering

Reshaping reality so a straight-line model can operate

Source: Step 1 Section 3.3 + Linear dataset description

Linear models assume additive relationships.

They cannot:

Learn automatic interaction terms
Detect non-linear thresholds
Handle extreme skew gracefully

So we transform reality into approximated segments.

. . .

1. Risk Segmentation Buckets

Continuous variables were discretized:

claim_amount_risk_band
customer_tenure_bucket
umbrella_limit_bin
policy_annual_premium_bin
days_since_bind_bin
days_since_bind_bucket_final

This converts curved risk patterns into step functions.

Example Concept

If fraud spikes under 30 days of tenure, then:
Instead of hoping the model learns a curve, we create explicit tenure buckets.

. . .

“Bucketed tenure stabilizes linear separability.”

. . .

2. Holiday & Calendar Features

Linear dataset includes:

incident_dow
incident_weekend
policy_bind_year
policy_bind_month
policy_bind_quarter
is_holiday
holiday_window_2d

These create additive behavioral markers.

Trees can find interactions between month and severity automatically.
Linear models require explicit signals.

. . .

3. Premium Risk Proxy

Feature:

is_high_risk_premium

Derived from premium threshold logic (Step 1 table p.6).

This compresses non-linear premium behavior into a binary additive factor.

. . .

4. Categorical Encoding Strategy

For linear models:

One-hot encoding applied
High-cardinality features excluded:
- auto_model
- insured_hobbies

Reason:
One-hot on 39 car models creates sparse instability.

Trees tolerate sparsity.
Linear coefficients explode.

. . .

“High-cardinality expansion increases dimensional instability for linear models.”

. . .

3C — Tree-Optimized Feature Architecture

Preserving structure for interaction discovery

Source: Step 1 Section 3.2

Trees operate differently.

They:

Split data recursively
Learn thresholds automatically
Capture interactions without manual feature crossing

So we preserve richness instead of compressing it.

. . .

1. Retaining High-Cardinality Features

Kept for Trees:

auto_make
auto_model
insured_hobbies

Trees can branch on:
“auto_model = X AND severity = Major”
without manual encoding of interactions.

. . .

2. Preserving Continuous Features

No aggressive bucketing applied.

Trees benefit from raw numeric structure.

They internally discover optimal split points.

. . .

3. Hour Buckets (Coarse Behavioral Context)

hour_bin_4:

Night
Morning
Afternoon
Evening

Coarse grouping supports behavioral clustering while retaining hour_sin/cos in shared layer.

. . .

“Tree models learn hierarchical interaction without manual crossing.”

. . .

Why Two Representations Matter

Linear Model World:

Decision boundary is a flat surface. If risk curves, you approximate it using steps.

Tree Model World:

Decision boundary is staircase-like. Model adapts naturally to curvature.

Same data. Different geometry. Different preparation.

This is architectural thinking — not just feature counting.

. . .

Summary of Engineering Philosophy

Shared Layer:

Normalize structure
Encode missingness
Convert temporal signals
Remove leakage

Linear Dataset:

Add buckets
Reduce dimensional explosion
Encode interactions explicitly

Tree Dataset:

Preserve cardinality
Retain non-linear numeric structure
Allow interaction discovery

This approach produced two modeling-ready datasets with different geometric assumptions, both grounded in the same cleaned base.

Claim Fraud Detection (ML): Feature Engineering Strategy

Purpose

3A — Core Feature Engineering (Shared Across Pipelines)

1. Claim Composition Features

injury_share

property_share

Why this matters

Code Illustration

2. Temporal Derivations

days_since_bind

vehicle_age

3. Cyclical Encoding of Time

hour_sin / hour_cos

4. Missingness Flags

5. CSL Parsing

Why Feature Engineering Exists

3B — Linear Model Feature Engineering

1. Risk Segmentation Buckets

Example Concept

2. Holiday & Calendar Features

3. Premium Risk Proxy

4. Categorical Encoding Strategy

3C — Tree-Optimized Feature Architecture

1. Retaining High-Cardinality Features

2. Preserving Continuous Features

3. Hour Buckets (Coarse Behavioral Context)

Why Two Representations Matter

Linear Model World:

Tree Model World:

Summary of Engineering Philosophy

Shared Layer:

Linear Dataset:

Tree Dataset: