Purpose

This report presents the exploratory analysis conducted on the cleaned insurance claims dataset.

The objective at this stage is to understand how fraud distributes itself across the existing features — before any model attempts to optimize, weight, or amplify those signals.

We examine:

  • Distribution behavior of financial variables
  • Fraud concentration across categorical features
  • Structural missingness patterns
  • Temporal patterns
  • Claim composition differences
  • Correlation structure across numeric variables

The guiding question is simple: Where does fraud appear to cluster naturally within the data?

Understanding this landscape is essential. A model can only amplify what is already present.

1. Financial Variables: The Shape of Claims

The primary monetary variables — total_claim_amount, injury_claim, property_claim, vehicle_claim — display strong right-skewed distributions.

Most claims are concentrated in moderate value ranges. A relatively small number extend into very high amounts.

In practical terms:

  • Typical claims cluster tightly.
  • A minority stretch far to the right.
  • Extreme claims are rare but large.

This distribution shape has two important implications:

  1. Raw magnitude alone cannot reliably define fraud.
  2. Large claim values will dominate statistical relationships unless normalized or restructured.

Exploratory correlation inspection confirms high collinearity between total claim amount and its components. This supports the later decision to create share-based features.

Claim amounts exhibit strong right-skew. A small number of high-value claims stretch the distribution.

2. Incident Severity: Clear Structural Separation

When fraud rate is examined across incident_severity categories, a clear pattern emerges.

Higher-severity categories — particularly Major Damage and Total Loss — show elevated fraud proportions relative to minor incidents.

This is one of the strongest categorical differentiators observed during exploration.

Why this matters:

Severity reflects both financial exposure and narrative plausibility. Severe accidents justify large claims. That creates space where exaggeration or fabrication may be harder to distinguish from legitimate damage.

Severity therefore represents a structurally meaningful axis of separation.

Fraud proportion increases noticeably in higher-severity categories.

3. Policy Tenure: Early Claims Concentration

Using the derived variable days_since_bind, fraud distribution was analyzed across policy age buckets.

The pattern shows:

  • Higher fraud concentration in early tenure periods
  • Gradual stabilization as policy age increases

Short-tenure claims appear disproportionately represented among fraud cases.

This is consistent with a behavioral hypothesis: Policies activated shortly before a claim may reflect opportunistic timing. The effect is not extreme, but it is directional and consistent.

Fraud incidence appears elevated in early policy periods.

4. Missing Documentation: Absence as Information

Several categorical variables contained meaningful missing values:

  • collision_type
  • police_report_available
  • property_damage
  • authorities_contacted

Exploratory comparison between present vs missing categories shows that missing collision information corresponds to a higher fraud proportion.

Missing documentation in insurance contexts can reflect:

  • Incomplete reporting
  • Delayed documentation
  • Ambiguity in event description

This is not interpreted as causal. However, the concentration pattern is clear enough to justify preserving missingness as explicit signal rather than discarding or imputing it.

Claims with missing collision type show elevated fraud proportion.

5. Demographics: Broad Distribution, Limited Separation

Exploration of demographic variables:

  • insured_sex
  • insured_education_level
  • insured_occupation

...reveals limited separation between fraud and non-fraud cases.

Fraud appears broadly distributed across categories without strong concentration in a single demographic group.

This suggests:

  • Fraud patterns in this dataset are more behavioral or structural than demographic.
  • Demographic features alone are unlikely to be dominant drivers.
Fraud distribution across gender categories shows minimal separation.

6. Claim Composition: Internal Allocation Patterns

Examining the ratio between sub-claims and total claim amount reveals additional structure.

By analyzing:

  • injury_claim / total_claim_amount
  • property_claim / total_claim_amount

...differences appear between fraud and non-fraud groups.

This suggests that not only total size, but allocation pattern may carry signal.

For example: Two claims may have identical totals, but differ in how that total is distributed across injury, property, and vehicle damage. This observation motivated the later introduction of share-based features.

Claim composition differences emerge when normalized by total claim amount.

7. Correlation Structure: Monetary Redundancy

Correlation analysis among numeric variables shows:

  • Strong correlation between total claim and its components
  • Redundant financial structure
  • High multicollinearity risk if all raw amounts are used together

This structural redundancy reinforces the need for normalized features and careful feature selection in later stages.

High correlation among financial variables indicates redundancy.

8. Temporal Patterns: Weak Direct Separation

Hour-of-day and weekday exploration shows limited direct fraud separation.

Fraud does not concentrate sharply around a single hour or weekday cluster.

This indicates:

  • Raw hour is not a dominant standalone predictor.
  • If temporal effects exist, they may require structured representation rather than simple categorical grouping.
Hour-of-day shows relatively flat fraud distribution.

9. Consolidated Observations

Fraud concentration appears strongest along:

  • Incident severity
  • Early policy tenure
  • Missing documentation
  • Claim composition patterns

Fraud appears broadly distributed across:

  • Demographic categories
  • Time-of-day

Monetary variables require normalization due to skew and multicollinearity.

These findings define the structural landscape upon which modeling will operate.

Key Takeaway

The exploratory analysis reveals that fraud in this dataset is most visibly associated with:

  1. Structural event severity
  2. Timing relative to policy initiation
  3. Documentation completeness
  4. Internal allocation of claim components

It is less visibly associated with demographic variables or isolated temporal markers.

This mapping of signal concentration provides a grounded foundation for feature engineering and model design in subsequent stages.