Training Pipelines (PFM & OBSERVER)¶

LightGBM training for detection (binary), multiclass (size, location), and regression (leak flow). Each pipeline has its own config section and script.

Scripts and config sections¶

Script	Config section	Task
`run_training_pfm_detection_pipeline.py`	`training_pfm_detection_pipeline`	Binary detection (PFM).
`run_training_pfm_size_pipeline.py`	`training_pfm_size_pipeline`	Multiclass size (PFM).
`run_training_pfm_location_pipeline.py`	`training_pfm_location_pipeline`	Multiclass location (PFM).
`run_training_pfm_leakflow_pipeline.py`	`training_pfm_leakflow_pipeline`	Regression leak flow (PFM).
`run_training_observer_detection_pipeline.py`	`training_observer_detection_pipeline`	Binary detection (OBSERVER).
`run_training_observer_size_pipeline.py`	`training_observer_size_pipeline`	Multiclass size (OBSERVER).
`run_training_observer_location_pipeline.py`	`training_observer_location_pipeline`	Multiclass location (OBSERVER).

Example:

python scripts/run_training_pfm_detection_pipeline.py --config configs/pipelines_config.yml

Common configuration¶

Key	Description
`input_path`	Parquet file or directory of Parquet (features dataset).
`output_folder`	Directory for model, metrics, and metadata.
`features_schema_file`	JSON list of feature names, typically produced by feature selection.
`label_column`	Target column (e.g. `label`, `LEAK_FLOW_kg_s`).
`test_size`	Validation split fraction.
`random_state`	Random seed.
`shuffle`	Shuffle before split.
`lgbm_params`	LightGBM hyperparameters.
`scale_features`	Whether to scale features (optional).
`robustness`	Optional robustness/augmentation config.

Detection-specific:

detection_threshold — Fixed threshold or auto.
threshold_strategy — e.g. max_f1.
min_recall — Constraint for threshold search.

Configuration reference (each item explained)¶

Item	Meaning	What it solves	Notes
`input_path`	Path to the features Parquet (file or directory).	Defines the training data; must match the schema and label columns you use.	Same dataset as feature selection / Optuna when applicable.
`features_schema_file`	Path to JSON listing feature names in the order expected by the model.	Ensures inference and training use the same columns in the same order; usually the output of feature selection.	In the current wrappers it is validated as required.
`output_folder`	Directory where the model, metrics, and metadata are written.	Centralizes artifacts for deployment and idempotency (config hash stored here).
`label_column`	Target column: `label` (detection), `LEAK_SIZE_in`, `LEAK_LOCATION_m`, or `LEAK_FLOW_kg_s`.	Defines what the model predicts; determines objective (binary, multiclass, regression).	Must exist in the features dataset.
`test_size`	Fraction of data used for validation (e.g. 0.2).	Enables early stopping and unbiased metrics; split is by case when possible to avoid leakage.
`random_state`	Random seed for split and training.	Reproducibility; same config + same data = same model.	Use `null` for non-reproducible runs.
`shuffle`	Whether to shuffle before splitting.	Ensures validation is representative when data is ordered.	Usually `true`.
`scale_features`	Whether to scale features before training.	LightGBM typically does not require scaling; set `false` unless you have a reason.
`detection_threshold`	Threshold on predicted probability (binary) or flow (OBSERVER) for “leak”.	Converts scores to binary decisions; `auto` picks a value that maximizes F1 on validation.	Only for detection pipelines.
`threshold_strategy`	How to choose threshold when `detection_threshold` is `auto` (e.g. `max_f1`).	Balances precision and recall in a single decision rule.
`min_recall`	Minimum recall constraint when searching for threshold.	Ensures a safety floor (e.g. 0.999) while optimizing other metrics.	Only for detection.
`robustness`	Optional augmentation: `enabled`, `copies`, `deadband_abs`, `quantize_decimals`, `noise_std_frac`, `feature_dropout_prob`, `random_state`.	Simulates sensor error and quantization so the model is robust to real-world noise.	Disable with `enabled: false` or omit.
`lgbm_params`	LightGBM hyperparameters (e.g. `objective`, `metric`, `learning_rate`, `num_leaves`, `max_depth`).	Controls model capacity and regularization; often filled from Optuna output.	Must match task: `binary`, `multiclass`, or `regression`.
`num_boost_round`	Maximum number of boosting rounds.	Upper bound on training length; early stopping usually stops earlier.
`early_stopping_rounds`	Stop if validation metric does not improve for this many rounds.	Prevents overfitting and saves time.
`min_improvement`	Minimum improvement to count as “better” for early stopping.	Avoids stopping on tiny fluctuations.
`verbose_eval`	Print evaluation every N rounds.	Lets you monitor training progress.

Multiclass/regression-specific: deviation_percentage (PFM size/location) discretizes continuous labels; min_accuracy can constrain Optuna; filter_zero_flow, filter_zero_size, filter_zero_location restrict training to leak samples when needed.

Configuration template (example: PFM detection)¶

Each training pipeline has its own section in configs/pipelines_config.yml. Example for training_pfm_detection_pipeline:

training_pfm_detection_pipeline:
  input_path: "data/features/SS/data/features_dataset.parquet"
  features_schema_file: "data/features/SS/metadata/features_schema_rfe_pfm_detection.json"
  output_folder: "models/SS/PFM/lgbm"

  label_column: "label"
  test_size: 0.20
  random_state: 42
  shuffle: true
  scale_features: false

  detection_threshold: 0.5
  threshold_strategy: "max_f1"
  min_recall: 0.999

  robustness:
    enabled: true
    copies: 3
    deadband_abs:
      min: 1e-4
      max: 5e-1
    quantize_decimals:
      min: 1
      max: 4
    noise_std_frac:
      min: 0.001
      max: 0.03
    random_state: 42

  lgbm_params:
    objective: "binary"
    metric: ["binary_logloss", "auc"]
    verbosity: -1
    boosting_type: "gbdt"
    random_state: 42
    learning_rate: 0.077
    num_leaves: 96
    max_depth: 7
    # ... (see pipelines_config.yml for full optimized params)

  num_boost_round: 10000
  early_stopping_rounds: 50
  min_improvement: 1.0e-6
  verbose_eval: 10

Other training sections follow the same structure with different keys: training_pfm_size_pipeline, training_pfm_location_pipeline, training_pfm_leakflow_pipeline, training_observer_detection_pipeline, training_observer_size_pipeline, training_observer_location_pipeline. See the full file pipelines_config.yml in the repository for each section.

Default values applied by the training scripts¶

All seven training wrappers validate these keys as required before running:

input_path
features_schema_file
output_folder

After that, each script merges its own DEFAULT_PIPELINE_CONFIG, DEFAULT_LGBM_PARAMS, and DEFAULT_ROBUSTNESS.

Top-level defaults by script¶

Config section	Core defaults	Task-specific defaults	Training loop defaults
`training_pfm_detection_pipeline`	`label_column="label"`, `test_size=0.20`, `random_state=42`, `shuffle=true`, `scale_features=false`	`detection_threshold=0.5`, `threshold_strategy="max_f1"`, `min_recall=0.999`	`num_boost_round=10000`, `early_stopping_rounds=50`, `min_improvement=1e-6`, `verbose_eval=10`
`training_pfm_size_pipeline`	`label_column="LEAK_SIZE_in"`, `test_size=0.25`, `random_state=42`, `shuffle=true`, `scale_features=false`	`deviation_percentage=5.0`, `min_accuracy=0.99`	`num_boost_round=5000`, `early_stopping_rounds=40`, `min_improvement=1e-6`, `verbose_eval=10`
`training_pfm_location_pipeline`	`label_column="LEAK_LOCATION_m"`, `test_size=0.20`, `random_state=42`, `shuffle=true`, `scale_features=false`	`deviation_percentage=5.0`, `min_accuracy=0.99`	`num_boost_round=5000`, `early_stopping_rounds=40`, `min_improvement=1e-6`, `verbose_eval=10`
`training_pfm_leakflow_pipeline`	`label_column="LEAK_FLOW_kg_s"`, `test_size=0.25`, `random_state=42`, `shuffle=true`	`filter_zero_flow=false`, `min_label=null`	`num_boost_round=5000`, `early_stopping_rounds=40`, `min_improvement=1e-6`, `verbose_eval=10`
`training_observer_detection_pipeline`	`label_column="LEAK_FLOW_kg_s"`, `test_size=0.20`, `random_state=42`, `shuffle=true`	`filter_zero_flow=false`, `min_label=null`, `detection_threshold="auto"`, `threshold_strategy="max_f1"`, `min_recall=null`, `target_transform="abs"`, `detection_label_column="label"`, `number_of_adjacent_windows_to_detect_leak=5`	`num_boost_round=5000`, `early_stopping_rounds=50`, `min_improvement=1e-6`, `verbose_eval=10`
`training_observer_size_pipeline`	`label_column="LEAK_SIZE_in"`, `test_size=0.25`, `random_state=42`, `shuffle=true`	`filter_zero_size=true`, `min_label=null`	`num_boost_round=5000`, `early_stopping_rounds=50`, `min_improvement=1e-6`, `verbose_eval=10`
`training_observer_location_pipeline`	`label_column="LEAK_LOCATION_m"`, `test_size=0.25`, `random_state=42`, `shuffle=true`	`filter_zero_location=true`, `min_label=null`	`num_boost_round=5000`, `early_stopping_rounds=50`, `min_improvement=1e-6`, `verbose_eval=10`

Default robustness blocks¶

All scripts merge a nested robustness block when it is omitted. Shared nested defaults are:

deadband_abs.min = 1e-4
deadband_abs.max = 5e-1
quantize_decimals.min = 1
quantize_decimals.max = 4
noise_std_abs.min = 0.0
noise_std_abs.max = 0.0
feature_dropout_prob.min = 0.0
feature_dropout_prob.max = 0.1
random_state = 42

Per script, the top-level robustness defaults are:

Config section	`enabled`	`copies`	`noise_std_frac`
`training_pfm_detection_pipeline`	`true`	`3`	`min=0.001`, `max=0.03`
`training_pfm_size_pipeline`	`false`	`0`	`min=0.001`, `max=0.02`
`training_pfm_location_pipeline`	`true`	`3`	`min=0.001`, `max=0.02`
`training_pfm_leakflow_pipeline`	`true`	`3`	`min=0.001`, `max=0.02`
`training_observer_detection_pipeline`	`true`	`3`	`min=0.001`, `max=0.02`
`training_observer_size_pipeline`	`true`	`3`	`min=0.001`, `max=0.02`
`training_observer_location_pipeline`	`true`	`3`	`min=0.001`, `max=0.02`

Default `lgbm_params` families¶

If lgbm_params is omitted, each script injects its own built-in optimized block from DEFAULT_LGBM_PARAMS. At a high level:

Config section	`objective`	`metric`
`training_pfm_detection_pipeline`	`binary`	`["binary_logloss", "auc"]`
`training_pfm_size_pipeline`	`multiclass`	`["multi_logloss", "multi_error"]`
`training_pfm_location_pipeline`	`multiclass`	`["multi_error", "multi_logloss"]`
`training_pfm_leakflow_pipeline`	`regression`	`["l1", "rmse"]`
`training_observer_detection_pipeline`	`regression`	`["l1", "rmse"]`
`training_observer_size_pipeline`	`regression`	`["l1", "rmse"]`
`training_observer_location_pipeline`	`regression`	`["l1", "rmse"]`

The exact numeric defaults for learning_rate, num_leaves, max_depth, regularization, bagging, and related LightGBM knobs are defined directly in each training script. If you omit lgbm_params, the whole optimized block from that script is applied.

Idempotency¶

Each script computes a config hash from relevant keys (including input_path, output_folder, features_schema_file).
If output_folder/training_metadata.json exists and its config_hash matches, the script exits without training and prints an idempotent message.

Outputs¶

model_lgbm.txt — LightGBM model.
metrics.json — Validation (and optionally train) metrics.
training_metadata.json — Config hash, paths, timestamp.
Detection pipelines: detection_metrics.json, confusion_matrix.json, and optional deployment bundle (e.g. lgbm_inference_config.json, features_schema.json).
Multiclass: Label mapping and optional confusion matrix.

S3¶

Full S3 support: input_path, output_folder, and features_schema_file can be S3. The trainer uses mlops.storage for listing, reading, and writing. Install s3fs (pip install .[s3]).