Skip to content

Overfitting Analysis

What We Mean by Overfitting

A model overfits when it performs very well on the training set but noticeably worse on validation or unseen cases. That usually means it has memorized case-specific or window-specific noise instead of learning general “leak vs non-leak” patterns.


How We Detect It

  • Train vs validation metrics: Compare accuracy, precision, recall, F1, AUC on train vs validation. If train is much higher than validation, we suspect overfitting.
  • By-case split: Because we split by case (see Train/validation split by case), validation metrics reflect performance on unseen cases, which is a good proxy for production.
  • Learning curves (optional): Plot metrics vs training set size or vs epoch/iteration; if validation performance plateaus or worsens while train keeps improving, that suggests overfitting.

Common Causes in This Platform

  • Too many features for the number of cases → use feature selection and a smaller schema.
  • Too complex a model (e.g. too many leaves or depth in LightGBM) → tune with Optuna and use early stopping.
  • Data leakage → avoided by splitting by case_id and by not using future or “test” information in features.
  • Label or preprocessing mismatch → ensure the same schema and preprocessing are used in training and validation.

What We Do About It

  • Regularization: LightGBM parameters (e.g. min_data_in_leaf, reg_alpha, reg_lambda) are tuned via hyperparameter optimization.
  • Feature selection: Reduce the number of features so the model has less capacity to memorize.
  • Early stopping: Stop training when validation metric stops improving.
  • Robustness options: Some training configs support “robustness” (e.g. noise or quantization) to make the model less sensitive to small input changes.

For concrete steps to reduce overfitting once detected, see Reducing overfitting.