Overfitting Analysis¶

What We Mean by Overfitting¶

A model overfits when it performs very well on the training set but noticeably worse on validation or unseen cases. That usually means it has memorized case-specific or window-specific noise instead of learning general “leak vs non-leak” patterns.

How We Detect It¶

Train vs validation metrics: Compare accuracy, precision, recall, F1, AUC on train vs validation. If train is much higher than validation, we suspect overfitting.
By-case split: Because we split by case (see Train/validation split by case), validation metrics reflect performance on unseen cases, which is a good proxy for production.
Learning curves (optional): Plot metrics vs training set size or vs epoch/iteration; if validation performance plateaus or worsens while train keeps improving, that suggests overfitting.

Common Causes in This Platform¶

Too many features for the number of cases → use feature selection and a smaller schema.
Too complex a model (e.g. too many leaves or depth in LightGBM) → tune with Optuna and use early stopping.
Data leakage → avoided by splitting by case_id and by not using future or “test” information in features.
Label or preprocessing mismatch → ensure the same schema and preprocessing are used in training and validation.

What We Do About It¶

Regularization: LightGBM parameters (e.g. min_data_in_leaf, reg_alpha, reg_lambda) are tuned via hyperparameter optimization.
Feature selection: Reduce the number of features so the model has less capacity to memorize.
Early stopping: Stop training when validation metric stops improving.
Robustness options: Some training configs support “robustness” (e.g. noise or quantization) to make the model less sensitive to small input changes.

For concrete steps to reduce overfitting once detected, see Reducing overfitting.