Overfitting Analysis¶
What We Mean by Overfitting¶
A model overfits when it performs very well on the training set but noticeably worse on validation or unseen cases. That usually means it has memorized case-specific or window-specific noise instead of learning general “leak vs non-leak” patterns.
How We Detect It¶
- Train vs validation metrics: Compare accuracy, precision, recall, F1, AUC on train vs validation. If train is much higher than validation, we suspect overfitting.
- By-case split: Because we split by case (see Train/validation split by case), validation metrics reflect performance on unseen cases, which is a good proxy for production.
- Learning curves (optional): Plot metrics vs training set size or vs epoch/iteration; if validation performance plateaus or worsens while train keeps improving, that suggests overfitting.
Common Causes in This Platform¶
- Too many features for the number of cases → use feature selection and a smaller schema.
- Too complex a model (e.g. too many leaves or depth in LightGBM) → tune with Optuna and use early stopping.
- Data leakage → avoided by splitting by case_id and by not using future or “test” information in features.
- Label or preprocessing mismatch → ensure the same schema and preprocessing are used in training and validation.
What We Do About It¶
- Regularization: LightGBM parameters (e.g.
min_data_in_leaf,reg_alpha,reg_lambda) are tuned via hyperparameter optimization. - Feature selection: Reduce the number of features so the model has less capacity to memorize.
- Early stopping: Stop training when validation metric stops improving.
- Robustness options: Some training configs support “robustness” (e.g. noise or quantization) to make the model less sensitive to small input changes.
For concrete steps to reduce overfitting once detected, see Reducing overfitting.