Pipelines Overview¶
All pipelines are driven by configs/pipelines_config.yml. Each pipeline has a top-level section (e.g. tpl_genkey_pipeline, features_pipeline). Scripts load that section and optionally accept CLI overrides.
Pipeline list¶
| Pipeline | Script | Purpose |
|---|---|---|
| TPL/GENKEY | run_tpl_genkey_pipeline.py |
Convert OLGA .tpl/.genkey to Parquet. |
| Windows | run_windows_pipeline.py |
Slice time-series Parquet into fixed-size windows. |
| Features | run_features_pipeline.py |
Extract wavelet (and raw) features from windows. |
| Parquet → CSV | run_parquet_csv_pipeline.py |
Export Parquet to CSV; optional leak filtering. |
| Feature selection | run_feature_selection_pipeline.py |
Select top-K features by LightGBM importance. |
| Training (PFM) | run_training_pfm_*_pipeline.py |
Detection, size, location, leak flow (LightGBM). |
| Training (OBSERVER) | run_training_observer_*_pipeline.py |
Detection, size, location (LightGBM). |
| Hyperparameter optimization | run_optimize_lgbm_hyperparameters.py |
Optuna tuning for detection/multiclass/regression. |
| Test offline | run_test_offline_pipeline.py, run_pfm_test_offline_pipeline.py |
Evaluate models on Parquet/CSV; Excel + plots. |
| LDS Bayesian | run_test_offline_pipeline.py, run_lds_bayesian_optimizacion.py |
Bayesian fusion of detectors and Optuna-based calibration. |
Common config concepts¶
Paths and storage¶
source_folder— Input directory (raw data, windows, etc.).output_folder— Output directory for Parquet, models, or artifacts.input_path— Single file or directory (used by training, feature selection, Optuna, test offline).
For S3:
- Use explicit
s3://bucket/keypaths, or - Set
storage.type: s3,storage.bucket, and optionalstorage.prefix; then relative paths are resolved under that bucket/prefix.
All path keys are resolved at runtime via resolve_config_paths(config). See Tech Stack and Development Conventions.
Idempotency¶
- ETL (TPL/GENKEY, Windows): Skip files that already have output in
output_folder. - Features: Skip if
features_metadata.jsonexists with same config hash. - Parquet–CSV: Skip CSV if it exists and
overwrite: false; config hash can force re-run when config changes. - Feature selection, training, Optuna, test offline: Skip if metadata file exists with same config hash.
To force re-execution, change the config or remove the output/metadata.
Detailed pipeline docs¶
- TPL/GENKEY (OLGA) — Raw OLGA to Parquet.
- Windows — Time-series to fixed windows.
- Features — Wavelet feature extraction.
- Parquet to CSV — Export and optional filtering.
- Feature selection — Top-K by importance.
- Training (PFM & OBSERVER) — All training pipelines and config.
- Hyperparameter optimization — Optuna.
- Test offline — Offline evaluation and reports.
- LDS Bayesian — Bayesian fusion and optimization.