Skip to content

Training Pipelines (PFM & OBSERVER)

LightGBM training for detection (binary), multiclass (size, location), and regression (leak flow). Each pipeline has its own config section and script.


Scripts and config sections

Script Config section Task
run_training_pfm_detection_pipeline.py training_pfm_detection_pipeline Binary detection (PFM).
run_training_pfm_size_pipeline.py training_pfm_size_pipeline Multiclass size (PFM).
run_training_pfm_location_pipeline.py training_pfm_location_pipeline Multiclass location (PFM).
run_training_pfm_leakflow_pipeline.py training_pfm_leakflow_pipeline Regression leak flow (PFM).
run_training_observer_detection_pipeline.py training_observer_detection_pipeline Binary detection (OBSERVER).
run_training_observer_size_pipeline.py training_observer_size_pipeline Multiclass size (OBSERVER).
run_training_observer_location_pipeline.py training_observer_location_pipeline Multiclass location (OBSERVER).

Example:

python scripts/run_training_pfm_detection_pipeline.py --config configs/pipelines_config.yml

Common configuration

Key Description
input_path Parquet file or directory of Parquet (features dataset).
output_folder Directory for model, metrics, and metadata.
features_schema_file JSON list of feature names, typically produced by feature selection.
label_column Target column (e.g. label, LEAK_FLOW_kg_s).
test_size Validation split fraction.
random_state Random seed.
shuffle Shuffle before split.
lgbm_params LightGBM hyperparameters.
scale_features Whether to scale features (optional).
robustness Optional robustness/augmentation config.

Detection-specific:

  • detection_threshold — Fixed threshold or auto.
  • threshold_strategy — e.g. max_f1.
  • min_recall — Constraint for threshold search.

Configuration reference (each item explained)

Item Meaning What it solves Notes
input_path Path to the features Parquet (file or directory). Defines the training data; must match the schema and label columns you use. Same dataset as feature selection / Optuna when applicable.
features_schema_file Path to JSON listing feature names in the order expected by the model. Ensures inference and training use the same columns in the same order; usually the output of feature selection. In the current wrappers it is validated as required.
output_folder Directory where the model, metrics, and metadata are written. Centralizes artifacts for deployment and idempotency (config hash stored here).
label_column Target column: label (detection), LEAK_SIZE_in, LEAK_LOCATION_m, or LEAK_FLOW_kg_s. Defines what the model predicts; determines objective (binary, multiclass, regression). Must exist in the features dataset.
test_size Fraction of data used for validation (e.g. 0.2). Enables early stopping and unbiased metrics; split is by case when possible to avoid leakage.
random_state Random seed for split and training. Reproducibility; same config + same data = same model. Use null for non-reproducible runs.
shuffle Whether to shuffle before splitting. Ensures validation is representative when data is ordered. Usually true.
scale_features Whether to scale features before training. LightGBM typically does not require scaling; set false unless you have a reason.
detection_threshold Threshold on predicted probability (binary) or flow (OBSERVER) for “leak”. Converts scores to binary decisions; auto picks a value that maximizes F1 on validation. Only for detection pipelines.
threshold_strategy How to choose threshold when detection_threshold is auto (e.g. max_f1). Balances precision and recall in a single decision rule.
min_recall Minimum recall constraint when searching for threshold. Ensures a safety floor (e.g. 0.999) while optimizing other metrics. Only for detection.
robustness Optional augmentation: enabled, copies, deadband_abs, quantize_decimals, noise_std_frac, feature_dropout_prob, random_state. Simulates sensor error and quantization so the model is robust to real-world noise. Disable with enabled: false or omit.
lgbm_params LightGBM hyperparameters (e.g. objective, metric, learning_rate, num_leaves, max_depth). Controls model capacity and regularization; often filled from Optuna output. Must match task: binary, multiclass, or regression.
num_boost_round Maximum number of boosting rounds. Upper bound on training length; early stopping usually stops earlier.
early_stopping_rounds Stop if validation metric does not improve for this many rounds. Prevents overfitting and saves time.
min_improvement Minimum improvement to count as “better” for early stopping. Avoids stopping on tiny fluctuations.
verbose_eval Print evaluation every N rounds. Lets you monitor training progress.

Multiclass/regression-specific: deviation_percentage (PFM size/location) discretizes continuous labels; min_accuracy can constrain Optuna; filter_zero_flow, filter_zero_size, filter_zero_location restrict training to leak samples when needed.


Configuration template (example: PFM detection)

Each training pipeline has its own section in configs/pipelines_config.yml. Example for training_pfm_detection_pipeline:

training_pfm_detection_pipeline:
  input_path: "data/features/SS/data/features_dataset.parquet"
  features_schema_file: "data/features/SS/metadata/features_schema_rfe_pfm_detection.json"
  output_folder: "models/SS/PFM/lgbm"

  label_column: "label"
  test_size: 0.20
  random_state: 42
  shuffle: true
  scale_features: false

  detection_threshold: 0.5
  threshold_strategy: "max_f1"
  min_recall: 0.999

  robustness:
    enabled: true
    copies: 3
    deadband_abs:
      min: 1e-4
      max: 5e-1
    quantize_decimals:
      min: 1
      max: 4
    noise_std_frac:
      min: 0.001
      max: 0.03
    random_state: 42

  lgbm_params:
    objective: "binary"
    metric: ["binary_logloss", "auc"]
    verbosity: -1
    boosting_type: "gbdt"
    random_state: 42
    learning_rate: 0.077
    num_leaves: 96
    max_depth: 7
    # ... (see pipelines_config.yml for full optimized params)

  num_boost_round: 10000
  early_stopping_rounds: 50
  min_improvement: 1.0e-6
  verbose_eval: 10

Other training sections follow the same structure with different keys: training_pfm_size_pipeline, training_pfm_location_pipeline, training_pfm_leakflow_pipeline, training_observer_detection_pipeline, training_observer_size_pipeline, training_observer_location_pipeline. See the full file pipelines_config.yml in the repository for each section.


Default values applied by the training scripts

All seven training wrappers validate these keys as required before running:

  • input_path
  • features_schema_file
  • output_folder

After that, each script merges its own DEFAULT_PIPELINE_CONFIG, DEFAULT_LGBM_PARAMS, and DEFAULT_ROBUSTNESS.

Top-level defaults by script

Config section Core defaults Task-specific defaults Training loop defaults
training_pfm_detection_pipeline label_column="label", test_size=0.20, random_state=42, shuffle=true, scale_features=false detection_threshold=0.5, threshold_strategy="max_f1", min_recall=0.999 num_boost_round=10000, early_stopping_rounds=50, min_improvement=1e-6, verbose_eval=10
training_pfm_size_pipeline label_column="LEAK_SIZE_in", test_size=0.25, random_state=42, shuffle=true, scale_features=false deviation_percentage=5.0, min_accuracy=0.99 num_boost_round=5000, early_stopping_rounds=40, min_improvement=1e-6, verbose_eval=10
training_pfm_location_pipeline label_column="LEAK_LOCATION_m", test_size=0.20, random_state=42, shuffle=true, scale_features=false deviation_percentage=5.0, min_accuracy=0.99 num_boost_round=5000, early_stopping_rounds=40, min_improvement=1e-6, verbose_eval=10
training_pfm_leakflow_pipeline label_column="LEAK_FLOW_kg_s", test_size=0.25, random_state=42, shuffle=true filter_zero_flow=false, min_label=null num_boost_round=5000, early_stopping_rounds=40, min_improvement=1e-6, verbose_eval=10
training_observer_detection_pipeline label_column="LEAK_FLOW_kg_s", test_size=0.20, random_state=42, shuffle=true filter_zero_flow=false, min_label=null, detection_threshold="auto", threshold_strategy="max_f1", min_recall=null, target_transform="abs", detection_label_column="label", number_of_adjacent_windows_to_detect_leak=5 num_boost_round=5000, early_stopping_rounds=50, min_improvement=1e-6, verbose_eval=10
training_observer_size_pipeline label_column="LEAK_SIZE_in", test_size=0.25, random_state=42, shuffle=true filter_zero_size=true, min_label=null num_boost_round=5000, early_stopping_rounds=50, min_improvement=1e-6, verbose_eval=10
training_observer_location_pipeline label_column="LEAK_LOCATION_m", test_size=0.25, random_state=42, shuffle=true filter_zero_location=true, min_label=null num_boost_round=5000, early_stopping_rounds=50, min_improvement=1e-6, verbose_eval=10

Default robustness blocks

All scripts merge a nested robustness block when it is omitted. Shared nested defaults are:

  • deadband_abs.min = 1e-4
  • deadband_abs.max = 5e-1
  • quantize_decimals.min = 1
  • quantize_decimals.max = 4
  • noise_std_abs.min = 0.0
  • noise_std_abs.max = 0.0
  • feature_dropout_prob.min = 0.0
  • feature_dropout_prob.max = 0.1
  • random_state = 42

Per script, the top-level robustness defaults are:

Config section enabled copies noise_std_frac
training_pfm_detection_pipeline true 3 min=0.001, max=0.03
training_pfm_size_pipeline false 0 min=0.001, max=0.02
training_pfm_location_pipeline true 3 min=0.001, max=0.02
training_pfm_leakflow_pipeline true 3 min=0.001, max=0.02
training_observer_detection_pipeline true 3 min=0.001, max=0.02
training_observer_size_pipeline true 3 min=0.001, max=0.02
training_observer_location_pipeline true 3 min=0.001, max=0.02

Default lgbm_params families

If lgbm_params is omitted, each script injects its own built-in optimized block from DEFAULT_LGBM_PARAMS. At a high level:

Config section objective metric
training_pfm_detection_pipeline binary ["binary_logloss", "auc"]
training_pfm_size_pipeline multiclass ["multi_logloss", "multi_error"]
training_pfm_location_pipeline multiclass ["multi_error", "multi_logloss"]
training_pfm_leakflow_pipeline regression ["l1", "rmse"]
training_observer_detection_pipeline regression ["l1", "rmse"]
training_observer_size_pipeline regression ["l1", "rmse"]
training_observer_location_pipeline regression ["l1", "rmse"]

The exact numeric defaults for learning_rate, num_leaves, max_depth, regularization, bagging, and related LightGBM knobs are defined directly in each training script. If you omit lgbm_params, the whole optimized block from that script is applied.


Idempotency

  • Each script computes a config hash from relevant keys (including input_path, output_folder, features_schema_file).
  • If output_folder/training_metadata.json exists and its config_hash matches, the script exits without training and prints an idempotent message.

Outputs

  • model_lgbm.txt — LightGBM model.
  • metrics.json — Validation (and optionally train) metrics.
  • training_metadata.json — Config hash, paths, timestamp.
  • Detection pipelines: detection_metrics.json, confusion_matrix.json, and optional deployment bundle (e.g. lgbm_inference_config.json, features_schema.json).
  • Multiclass: Label mapping and optional confusion matrix.

S3

  • Full S3 support: input_path, output_folder, and features_schema_file can be S3. The trainer uses mlops.storage for listing, reading, and writing. Install s3fs (pip install .[s3]).