Reducing Overfitting¶
Practical Levers¶
Once you have detected overfitting (e.g. large gap between train and validation metrics; see Overfitting analysis), you can act on the following.
1. Simplify the Model¶
- LightGBM: Increase regularization (
reg_alpha,reg_lambda), increasemin_data_in_leaf, or reducenum_leaves/ max depth. - Use hyperparameter optimization (Optuna) to search these parameters; the objective is validation metric (or cross-validation over cases), so the chosen model tends to generalize better.
2. Reduce Features¶
- Run feature selection (feature selection pipeline) and keep only the top-K features by importance.
- Retrain with the reduced schema; fewer features reduce the model’s capacity to memorize.
3. More and More Diverse Data¶
- Add more cases (more runs, more leak sizes/locations) so the model sees a broader distribution.
- Ensure train/validation split by case so that “more data” means more cases, not more windows from the same few cases.
4. Early Stopping and Validation¶
- Use early stopping on the validation metric so training stops when validation stops improving.
- Report and monitor both train and validation metrics in the training scripts and in model validation.
5. Robustness (Optional)¶
- If the training config supports robustness (e.g. adding small noise or quantizing inputs), enable it so the model is trained to be less sensitive to tiny input changes.
- This can improve generalization without changing the architecture.
Summary¶
| Lever | Action |
|---|---|
| Regularization | Tune LightGBM reg_*, min_data_in_leaf, num_leaves via Optuna. |
| Features | Use feature selection; train with fewer, stronger features. |
| Data | More cases; split by case_id. |
| Stopping | Early stopping on validation metric. |
| Robustness | Enable if available in training config. |
Consistent use of by-case split and validation metrics in the platform makes it easier to measure and then reduce overfitting in a principled way.