Skip to content

Pipelines Overview

All pipelines are driven by configs/pipelines_config.yml. Each pipeline has a top-level section (e.g. tpl_genkey_pipeline, features_pipeline). Scripts load that section and optionally accept CLI overrides.


Pipeline list

Pipeline Script Purpose
TPL/GENKEY run_tpl_genkey_pipeline.py Convert OLGA .tpl/.genkey to Parquet.
Windows run_windows_pipeline.py Slice time-series Parquet into fixed-size windows.
Features run_features_pipeline.py Extract wavelet (and raw) features from windows.
Parquet → CSV run_parquet_csv_pipeline.py Export Parquet to CSV; optional leak filtering.
Feature selection run_feature_selection_pipeline.py Select top-K features by LightGBM importance.
Training (PFM) run_training_pfm_*_pipeline.py Detection, size, location, leak flow (LightGBM).
Training (OBSERVER) run_training_observer_*_pipeline.py Detection, size, location (LightGBM).
Hyperparameter optimization run_optimize_lgbm_hyperparameters.py Optuna tuning for detection/multiclass/regression.
Test offline run_test_offline_pipeline.py, run_pfm_test_offline_pipeline.py Evaluate models on Parquet/CSV; Excel + plots.
LDS Bayesian run_test_offline_pipeline.py, run_lds_bayesian_optimizacion.py Bayesian fusion of detectors and Optuna-based calibration.

Common config concepts

Paths and storage

  • source_folder — Input directory (raw data, windows, etc.).
  • output_folder — Output directory for Parquet, models, or artifacts.
  • input_path — Single file or directory (used by training, feature selection, Optuna, test offline).

For S3:

  • Use explicit s3://bucket/key paths, or
  • Set storage.type: s3, storage.bucket, and optional storage.prefix; then relative paths are resolved under that bucket/prefix.

All path keys are resolved at runtime via resolve_config_paths(config). See Tech Stack and Development Conventions.

Idempotency

  • ETL (TPL/GENKEY, Windows): Skip files that already have output in output_folder.
  • Features: Skip if features_metadata.json exists with same config hash.
  • Parquet–CSV: Skip CSV if it exists and overwrite: false; config hash can force re-run when config changes.
  • Feature selection, training, Optuna, test offline: Skip if metadata file exists with same config hash.

To force re-execution, change the config or remove the output/metadata.


Detailed pipeline docs