Pipelines Overview¶

All pipelines are driven by configs/pipelines_config.yml. Each pipeline has a top-level section (e.g. tpl_genkey_pipeline, features_pipeline). Scripts load that section and optionally accept CLI overrides.

Pipeline list¶

Pipeline	Script	Purpose
TPL/GENKEY	`run_tpl_genkey_pipeline.py`	Convert OLGA `.tpl`/`.genkey` to Parquet.
Windows	`run_windows_pipeline.py`	Slice time-series Parquet into fixed-size windows.
Features	`run_features_pipeline.py`	Extract wavelet (and raw) features from windows.
Parquet → CSV	`run_parquet_csv_pipeline.py`	Export Parquet to CSV; optional leak filtering.
Feature selection	`run_feature_selection_pipeline.py`	Select top-K features by LightGBM importance.
Training (PFM)	`run_training_pfm_*_pipeline.py`	Detection, size, location, leak flow (LightGBM).
Training (OBSERVER)	`run_training_observer_*_pipeline.py`	Detection, size, location (LightGBM).
Hyperparameter optimization	`run_optimize_lgbm_hyperparameters.py`	Optuna tuning for detection/multiclass/regression.
Test offline	`run_test_offline_pipeline.py`, `run_pfm_test_offline_pipeline.py`	Evaluate models on Parquet/CSV; Excel + plots.
LDS Bayesian	`run_test_offline_pipeline.py`, `run_lds_bayesian_optimizacion.py`	Bayesian fusion of detectors and Optuna-based calibration.

Common config concepts¶

Paths and storage¶

source_folder — Input directory (raw data, windows, etc.).
output_folder — Output directory for Parquet, models, or artifacts.
input_path — Single file or directory (used by training, feature selection, Optuna, test offline).

For S3:

Use explicit s3://bucket/key paths, or
Set storage.type: s3, storage.bucket, and optional storage.prefix; then relative paths are resolved under that bucket/prefix.

All path keys are resolved at runtime via resolve_config_paths(config). See Tech Stack and Development Conventions.

Idempotency¶

ETL (TPL/GENKEY, Windows): Skip files that already have output in output_folder.
Features: Skip if features_metadata.json exists with same config hash.
Parquet–CSV: Skip CSV if it exists and overwrite: false; config hash can force re-run when config changes.
Feature selection, training, Optuna, test offline: Skip if metadata file exists with same config hash.

To force re-execution, change the config or remove the output/metadata.

Detailed pipeline docs¶

TPL/GENKEY (OLGA) — Raw OLGA to Parquet.
Windows — Time-series to fixed windows.
Features — Wavelet feature extraction.
Parquet to CSV — Export and optional filtering.
Feature selection — Top-K by importance.
Training (PFM & OBSERVER) — All training pipelines and config.
Hyperparameter optimization — Optuna.
Test offline — Offline evaluation and reports.
LDS Bayesian — Bayesian fusion and optimization.