Training Pipelines (PFM & OBSERVER)¶
LightGBM training for detection (binary), multiclass (size, location), and regression (leak flow). Each pipeline has its own config section and script.
Scripts and config sections¶
| Script | Config section | Task |
|---|---|---|
run_training_pfm_detection_pipeline.py |
training_pfm_detection_pipeline |
Binary detection (PFM). |
run_training_pfm_size_pipeline.py |
training_pfm_size_pipeline |
Multiclass size (PFM). |
run_training_pfm_location_pipeline.py |
training_pfm_location_pipeline |
Multiclass location (PFM). |
run_training_pfm_leakflow_pipeline.py |
training_pfm_leakflow_pipeline |
Regression leak flow (PFM). |
run_training_observer_detection_pipeline.py |
training_observer_detection_pipeline |
Binary detection (OBSERVER). |
run_training_observer_size_pipeline.py |
training_observer_size_pipeline |
Multiclass size (OBSERVER). |
run_training_observer_location_pipeline.py |
training_observer_location_pipeline |
Multiclass location (OBSERVER). |
Example:
Common configuration¶
| Key | Description |
|---|---|
input_path |
Parquet file or directory of Parquet (features dataset). |
output_folder |
Directory for model, metrics, and metadata. |
features_schema_file |
JSON list of feature names, typically produced by feature selection. |
label_column |
Target column (e.g. label, LEAK_FLOW_kg_s). |
test_size |
Validation split fraction. |
random_state |
Random seed. |
shuffle |
Shuffle before split. |
lgbm_params |
LightGBM hyperparameters. |
scale_features |
Whether to scale features (optional). |
robustness |
Optional robustness/augmentation config. |
Detection-specific:
detection_threshold— Fixed threshold orauto.threshold_strategy— e.g.max_f1.min_recall— Constraint for threshold search.
Configuration reference (each item explained)¶
| Item | Meaning | What it solves | Notes |
|---|---|---|---|
input_path |
Path to the features Parquet (file or directory). | Defines the training data; must match the schema and label columns you use. | Same dataset as feature selection / Optuna when applicable. |
features_schema_file |
Path to JSON listing feature names in the order expected by the model. | Ensures inference and training use the same columns in the same order; usually the output of feature selection. | In the current wrappers it is validated as required. |
output_folder |
Directory where the model, metrics, and metadata are written. | Centralizes artifacts for deployment and idempotency (config hash stored here). | |
label_column |
Target column: label (detection), LEAK_SIZE_in, LEAK_LOCATION_m, or LEAK_FLOW_kg_s. |
Defines what the model predicts; determines objective (binary, multiclass, regression). | Must exist in the features dataset. |
test_size |
Fraction of data used for validation (e.g. 0.2). | Enables early stopping and unbiased metrics; split is by case when possible to avoid leakage. | |
random_state |
Random seed for split and training. | Reproducibility; same config + same data = same model. | Use null for non-reproducible runs. |
shuffle |
Whether to shuffle before splitting. | Ensures validation is representative when data is ordered. | Usually true. |
scale_features |
Whether to scale features before training. | LightGBM typically does not require scaling; set false unless you have a reason. |
|
detection_threshold |
Threshold on predicted probability (binary) or flow (OBSERVER) for “leak”. | Converts scores to binary decisions; auto picks a value that maximizes F1 on validation. |
Only for detection pipelines. |
threshold_strategy |
How to choose threshold when detection_threshold is auto (e.g. max_f1). |
Balances precision and recall in a single decision rule. | |
min_recall |
Minimum recall constraint when searching for threshold. | Ensures a safety floor (e.g. 0.999) while optimizing other metrics. | Only for detection. |
robustness |
Optional augmentation: enabled, copies, deadband_abs, quantize_decimals, noise_std_frac, feature_dropout_prob, random_state. |
Simulates sensor error and quantization so the model is robust to real-world noise. | Disable with enabled: false or omit. |
lgbm_params |
LightGBM hyperparameters (e.g. objective, metric, learning_rate, num_leaves, max_depth). |
Controls model capacity and regularization; often filled from Optuna output. | Must match task: binary, multiclass, or regression. |
num_boost_round |
Maximum number of boosting rounds. | Upper bound on training length; early stopping usually stops earlier. | |
early_stopping_rounds |
Stop if validation metric does not improve for this many rounds. | Prevents overfitting and saves time. | |
min_improvement |
Minimum improvement to count as “better” for early stopping. | Avoids stopping on tiny fluctuations. | |
verbose_eval |
Print evaluation every N rounds. | Lets you monitor training progress. |
Multiclass/regression-specific: deviation_percentage (PFM size/location) discretizes continuous labels; min_accuracy can constrain Optuna; filter_zero_flow, filter_zero_size, filter_zero_location restrict training to leak samples when needed.
Configuration template (example: PFM detection)¶
Each training pipeline has its own section in configs/pipelines_config.yml. Example for training_pfm_detection_pipeline:
training_pfm_detection_pipeline:
input_path: "data/features/SS/data/features_dataset.parquet"
features_schema_file: "data/features/SS/metadata/features_schema_rfe_pfm_detection.json"
output_folder: "models/SS/PFM/lgbm"
label_column: "label"
test_size: 0.20
random_state: 42
shuffle: true
scale_features: false
detection_threshold: 0.5
threshold_strategy: "max_f1"
min_recall: 0.999
robustness:
enabled: true
copies: 3
deadband_abs:
min: 1e-4
max: 5e-1
quantize_decimals:
min: 1
max: 4
noise_std_frac:
min: 0.001
max: 0.03
random_state: 42
lgbm_params:
objective: "binary"
metric: ["binary_logloss", "auc"]
verbosity: -1
boosting_type: "gbdt"
random_state: 42
learning_rate: 0.077
num_leaves: 96
max_depth: 7
# ... (see pipelines_config.yml for full optimized params)
num_boost_round: 10000
early_stopping_rounds: 50
min_improvement: 1.0e-6
verbose_eval: 10
Other training sections follow the same structure with different keys: training_pfm_size_pipeline, training_pfm_location_pipeline, training_pfm_leakflow_pipeline, training_observer_detection_pipeline, training_observer_size_pipeline, training_observer_location_pipeline. See the full file pipelines_config.yml in the repository for each section.
Default values applied by the training scripts¶
All seven training wrappers validate these keys as required before running:
input_pathfeatures_schema_fileoutput_folder
After that, each script merges its own DEFAULT_PIPELINE_CONFIG, DEFAULT_LGBM_PARAMS,
and DEFAULT_ROBUSTNESS.
Top-level defaults by script¶
| Config section | Core defaults | Task-specific defaults | Training loop defaults |
|---|---|---|---|
training_pfm_detection_pipeline |
label_column="label", test_size=0.20, random_state=42, shuffle=true, scale_features=false |
detection_threshold=0.5, threshold_strategy="max_f1", min_recall=0.999 |
num_boost_round=10000, early_stopping_rounds=50, min_improvement=1e-6, verbose_eval=10 |
training_pfm_size_pipeline |
label_column="LEAK_SIZE_in", test_size=0.25, random_state=42, shuffle=true, scale_features=false |
deviation_percentage=5.0, min_accuracy=0.99 |
num_boost_round=5000, early_stopping_rounds=40, min_improvement=1e-6, verbose_eval=10 |
training_pfm_location_pipeline |
label_column="LEAK_LOCATION_m", test_size=0.20, random_state=42, shuffle=true, scale_features=false |
deviation_percentage=5.0, min_accuracy=0.99 |
num_boost_round=5000, early_stopping_rounds=40, min_improvement=1e-6, verbose_eval=10 |
training_pfm_leakflow_pipeline |
label_column="LEAK_FLOW_kg_s", test_size=0.25, random_state=42, shuffle=true |
filter_zero_flow=false, min_label=null |
num_boost_round=5000, early_stopping_rounds=40, min_improvement=1e-6, verbose_eval=10 |
training_observer_detection_pipeline |
label_column="LEAK_FLOW_kg_s", test_size=0.20, random_state=42, shuffle=true |
filter_zero_flow=false, min_label=null, detection_threshold="auto", threshold_strategy="max_f1", min_recall=null, target_transform="abs", detection_label_column="label", number_of_adjacent_windows_to_detect_leak=5 |
num_boost_round=5000, early_stopping_rounds=50, min_improvement=1e-6, verbose_eval=10 |
training_observer_size_pipeline |
label_column="LEAK_SIZE_in", test_size=0.25, random_state=42, shuffle=true |
filter_zero_size=true, min_label=null |
num_boost_round=5000, early_stopping_rounds=50, min_improvement=1e-6, verbose_eval=10 |
training_observer_location_pipeline |
label_column="LEAK_LOCATION_m", test_size=0.25, random_state=42, shuffle=true |
filter_zero_location=true, min_label=null |
num_boost_round=5000, early_stopping_rounds=50, min_improvement=1e-6, verbose_eval=10 |
Default robustness blocks¶
All scripts merge a nested robustness block when it is omitted. Shared nested defaults are:
deadband_abs.min = 1e-4deadband_abs.max = 5e-1quantize_decimals.min = 1quantize_decimals.max = 4noise_std_abs.min = 0.0noise_std_abs.max = 0.0feature_dropout_prob.min = 0.0feature_dropout_prob.max = 0.1random_state = 42
Per script, the top-level robustness defaults are:
| Config section | enabled |
copies |
noise_std_frac |
|---|---|---|---|
training_pfm_detection_pipeline |
true |
3 |
min=0.001, max=0.03 |
training_pfm_size_pipeline |
false |
0 |
min=0.001, max=0.02 |
training_pfm_location_pipeline |
true |
3 |
min=0.001, max=0.02 |
training_pfm_leakflow_pipeline |
true |
3 |
min=0.001, max=0.02 |
training_observer_detection_pipeline |
true |
3 |
min=0.001, max=0.02 |
training_observer_size_pipeline |
true |
3 |
min=0.001, max=0.02 |
training_observer_location_pipeline |
true |
3 |
min=0.001, max=0.02 |
Default lgbm_params families¶
If lgbm_params is omitted, each script injects its own built-in optimized block from
DEFAULT_LGBM_PARAMS. At a high level:
| Config section | objective |
metric |
|---|---|---|
training_pfm_detection_pipeline |
binary |
["binary_logloss", "auc"] |
training_pfm_size_pipeline |
multiclass |
["multi_logloss", "multi_error"] |
training_pfm_location_pipeline |
multiclass |
["multi_error", "multi_logloss"] |
training_pfm_leakflow_pipeline |
regression |
["l1", "rmse"] |
training_observer_detection_pipeline |
regression |
["l1", "rmse"] |
training_observer_size_pipeline |
regression |
["l1", "rmse"] |
training_observer_location_pipeline |
regression |
["l1", "rmse"] |
The exact numeric defaults for learning_rate, num_leaves, max_depth, regularization,
bagging, and related LightGBM knobs are defined directly in each training script. If you omit
lgbm_params, the whole optimized block from that script is applied.
Idempotency¶
- Each script computes a config hash from relevant keys (including
input_path,output_folder,features_schema_file). - If
output_folder/training_metadata.jsonexists and itsconfig_hashmatches, the script exits without training and prints an idempotent message.
Outputs¶
model_lgbm.txt— LightGBM model.metrics.json— Validation (and optionally train) metrics.training_metadata.json— Config hash, paths, timestamp.- Detection pipelines:
detection_metrics.json,confusion_matrix.json, and optional deployment bundle (e.g.lgbm_inference_config.json,features_schema.json). - Multiclass: Label mapping and optional confusion matrix.
S3¶
- Full S3 support:
input_path,output_folder, andfeatures_schema_filecan be S3. The trainer usesmlops.storagefor listing, reading, and writing. Installs3fs(pip install .[s3]).