Train/Validation Split by Case¶
Why Split by Case (Not by Window)¶
The feature dataset has one row per window, but many windows come from the same physical case (same simulation or same run). Each case has a case_id.
If we split by row (by window), then:
- Some windows from case A could be in train and others from the same case A in validation.
- The model would effectively see “the same leak” in both sets → data leakage and overstated validation performance.
So we split by case: either all windows of a case go to train, or all go to validation. No case appears in both. That way validation measures performance on unseen cases, which is what we care about in production.
How It Works¶
- Load features and read the
case_idcolumn (if present). - Align to the feature schema and build X, y (and optionally groups = case_id).
- Split:
- If
case_idexists: UseGroupShuffleSplit(or equivalent) with groups = case_id. So each group is one case; all rows of that group go to the same side (train or validation). - If
case_idis missing: Fall back to a normal train_test_split (e.g. withstratify=yfor binary/multiclass to keep class balance).
This logic is used in:
- All training scripts (
run_training_pfm_*,run_training_observer_*). - Hyperparameter optimization (
run_optimize_lgbm_hyperparameters.py), so Optuna evaluates on the same “by-case” split and the chosen hyperparameters generalize to new cases.
Practical Implications¶
- Stratification: When splitting by case, we try to keep a similar proportion of leak vs non-leak cases in train and validation (when the API allows it for group splits).
- Small number of cases: If you have very few cases, validation can be noisy; consider cross-validation over cases or collecting more runs.
- Reproducibility: Use a fixed random_state in the split so that the same config and data always produce the same train/validation sets.
This design is central to trustworthy validation metrics and to the platform’s model validation and overfitting philosophy.