Skip to content

Leak Detection MLOps Platform Documentation

Frameworks & Libraries

Frameworks & Libraries¶

This page details the main frameworks and libraries used across the platform and how they are used.

ETL & Data¶

pandas¶

Role: Primary data structure for tabular data in ETL, features, and training.
Usage: Reading/writing Parquet and CSV, windowing, joins, and feature alignment. All pipeline scripts rely on DataFrames for in-memory processing.

PyArrow¶

Role: Fast Parquet I/O and columnar representation.
Usage: pandas.read_parquet(..., engine="pyarrow") and df.to_parquet(..., engine="pyarrow"). Row groups and compression are configurable in the features and parquet–CSV pipelines.

s3fs¶

Role: S3-backed filesystem compatible with fsspec.
Usage: Used by mlops.storage to resolve s3:// paths, list objects, and read/write files when the config uses S3 (explicit URIs or storage.type: s3). Optional dependency: pip install .[s3].

PyWavelets (PyWT)¶

Role: Wavelet transforms for feature extraction.
Usage: In the features pipeline, wavelet coefficients (e.g. db2, configurable level) are computed per window and combined into a fixed-length feature vector per row.

NumPy / SciPy¶

Role: Numerical operations and signal/statistical helpers.
Usage: Array ops in transforms, discretization (e.g. leak size clusters), and metric computation.

Machine Learning¶

LightGBM¶

Role: Gradient boosting for classification and regression.
Usage: All training pipelines (PFM detection/size/location/leak flow, OBSERVER detection/size/location), feature selection (importance-based top-K), and offline evaluation (inference). Training uses configurable lgbm_params, early stopping, and validation splits (by row or by case).

scikit-learn¶

Role: Splits, metrics, and preprocessing.
Usage: train_test_split, GroupShuffleSplit (case-based splits), accuracy/precision/recall/F1/ROC-AUC, scaling (optional), and clustering for multiclass labels.

Optuna¶

Role: Hyperparameter optimization.
Usage: run_optimize_lgbm_hyperparameters.py runs Optuna studies for detection, multiclass, or regression objectives; supports pruning, timeout, and optional pre–feature selection.

joblib¶

Role: Serialization of models and large objects.
Usage: Saving/loading LightGBM models and artifacts in training and deployment bundles.

Orchestration & Scheduling¶

APScheduler¶

Role: In-process job scheduling.
Usage: etl_scheduler.py uses AsyncIOScheduler with interval, cron, or daily triggers to run the TPL/GENKEY pipeline (or other configured pipeline) on a schedule. Logging, stats, and graceful shutdown are built on top.

Prefect 3¶

Role: Flow orchestration and production deployment.
Usage: Flows wrap pipeline execution (e.g. genkey_pipeline_flow); deployments and schedules are defined in prefect.yaml and managed via deploy_flows.py / prefect_manage.sh. Progress and logs can be sent to Prefect via PrefectProgressMonitor.

PyYAML¶

Role: Configuration loading.
Usage: All YAML configs (pipelines_config.yml, ETL scheduler configs, Prefect-related config) are loaded with yaml.safe_load.

Reporting & Export¶

openpyxl¶

Role: Excel file generation.
Usage: Offline test pipelines write Excel reports (e.g. offline_report.xlsx) with multiple sheets and optional styling.

Matplotlib¶

Role: Plotting.
Usage: Training curves, confusion matrices, diagnostic plots, and leak series in offline evaluation.

Concurrency & System¶

concurrent.futures¶

Role: Parallel execution.
Usage: ProcessPoolExecutor in the features pipeline (parallel file processing); ThreadPoolExecutor in loaders for I/O.

psutil¶

Role: System and process information.
Usage: Optional use for resource monitoring and worker sizing.

tzlocal / pytz¶

Role: Timezone handling.
Usage: Scheduler and Prefect schedules use configurable timezones for cron and daily runs.

Version Bounds¶

Exact version bounds are defined in pyproject.toml under [project.optional-dependencies] (e.g. etl, ml, s3, docs). Install with:

pip install -e ".[etl,ml,s3,docs]"