Skip to content

Frameworks & Libraries

This page details the main frameworks and libraries used across the platform and how they are used.


ETL & Data

pandas

  • Role: Primary data structure for tabular data in ETL, features, and training.
  • Usage: Reading/writing Parquet and CSV, windowing, joins, and feature alignment. All pipeline scripts rely on DataFrames for in-memory processing.

PyArrow

  • Role: Fast Parquet I/O and columnar representation.
  • Usage: pandas.read_parquet(..., engine="pyarrow") and df.to_parquet(..., engine="pyarrow"). Row groups and compression are configurable in the features and parquet–CSV pipelines.

s3fs

  • Role: S3-backed filesystem compatible with fsspec.
  • Usage: Used by mlops.storage to resolve s3:// paths, list objects, and read/write files when the config uses S3 (explicit URIs or storage.type: s3). Optional dependency: pip install .[s3].

PyWavelets (PyWT)

  • Role: Wavelet transforms for feature extraction.
  • Usage: In the features pipeline, wavelet coefficients (e.g. db2, configurable level) are computed per window and combined into a fixed-length feature vector per row.

NumPy / SciPy

  • Role: Numerical operations and signal/statistical helpers.
  • Usage: Array ops in transforms, discretization (e.g. leak size clusters), and metric computation.

Machine Learning

LightGBM

  • Role: Gradient boosting for classification and regression.
  • Usage: All training pipelines (PFM detection/size/location/leak flow, OBSERVER detection/size/location), feature selection (importance-based top-K), and offline evaluation (inference). Training uses configurable lgbm_params, early stopping, and validation splits (by row or by case).

scikit-learn

  • Role: Splits, metrics, and preprocessing.
  • Usage: train_test_split, GroupShuffleSplit (case-based splits), accuracy/precision/recall/F1/ROC-AUC, scaling (optional), and clustering for multiclass labels.

Optuna

  • Role: Hyperparameter optimization.
  • Usage: run_optimize_lgbm_hyperparameters.py runs Optuna studies for detection, multiclass, or regression objectives; supports pruning, timeout, and optional pre–feature selection.

joblib

  • Role: Serialization of models and large objects.
  • Usage: Saving/loading LightGBM models and artifacts in training and deployment bundles.

Orchestration & Scheduling

APScheduler

  • Role: In-process job scheduling.
  • Usage: etl_scheduler.py uses AsyncIOScheduler with interval, cron, or daily triggers to run the TPL/GENKEY pipeline (or other configured pipeline) on a schedule. Logging, stats, and graceful shutdown are built on top.

Prefect 3

  • Role: Flow orchestration and production deployment.
  • Usage: Flows wrap pipeline execution (e.g. genkey_pipeline_flow); deployments and schedules are defined in prefect.yaml and managed via deploy_flows.py / prefect_manage.sh. Progress and logs can be sent to Prefect via PrefectProgressMonitor.

PyYAML

  • Role: Configuration loading.
  • Usage: All YAML configs (pipelines_config.yml, ETL scheduler configs, Prefect-related config) are loaded with yaml.safe_load.

Reporting & Export

openpyxl

  • Role: Excel file generation.
  • Usage: Offline test pipelines write Excel reports (e.g. offline_report.xlsx) with multiple sheets and optional styling.

Matplotlib

  • Role: Plotting.
  • Usage: Training curves, confusion matrices, diagnostic plots, and leak series in offline evaluation.

Concurrency & System

concurrent.futures

  • Role: Parallel execution.
  • Usage: ProcessPoolExecutor in the features pipeline (parallel file processing); ThreadPoolExecutor in loaders for I/O.

psutil

  • Role: System and process information.
  • Usage: Optional use for resource monitoring and worker sizing.

tzlocal / pytz

  • Role: Timezone handling.
  • Usage: Scheduler and Prefect schedules use configurable timezones for cron and daily runs.

Version Bounds

Exact version bounds are defined in pyproject.toml under [project.optional-dependencies] (e.g. etl, ml, s3, docs). Install with:

pip install -e ".[etl,ml,s3,docs]"