Frameworks & Libraries¶
This page details the main frameworks and libraries used across the platform and how they are used.
ETL & Data¶
pandas¶
- Role: Primary data structure for tabular data in ETL, features, and training.
- Usage: Reading/writing Parquet and CSV, windowing, joins, and feature alignment. All pipeline scripts rely on DataFrames for in-memory processing.
PyArrow¶
- Role: Fast Parquet I/O and columnar representation.
- Usage:
pandas.read_parquet(..., engine="pyarrow")anddf.to_parquet(..., engine="pyarrow"). Row groups and compression are configurable in the features and parquet–CSV pipelines.
s3fs¶
- Role: S3-backed filesystem compatible with fsspec.
- Usage: Used by
mlops.storageto resolves3://paths, list objects, and read/write files when the config uses S3 (explicit URIs orstorage.type: s3). Optional dependency:pip install .[s3].
PyWavelets (PyWT)¶
- Role: Wavelet transforms for feature extraction.
- Usage: In the features pipeline, wavelet coefficients (e.g.
db2, configurable level) are computed per window and combined into a fixed-length feature vector per row.
NumPy / SciPy¶
- Role: Numerical operations and signal/statistical helpers.
- Usage: Array ops in transforms, discretization (e.g. leak size clusters), and metric computation.
Machine Learning¶
LightGBM¶
- Role: Gradient boosting for classification and regression.
- Usage: All training pipelines (PFM detection/size/location/leak flow, OBSERVER detection/size/location), feature selection (importance-based top-K), and offline evaluation (inference). Training uses configurable
lgbm_params, early stopping, and validation splits (by row or by case).
scikit-learn¶
- Role: Splits, metrics, and preprocessing.
- Usage:
train_test_split,GroupShuffleSplit(case-based splits), accuracy/precision/recall/F1/ROC-AUC, scaling (optional), and clustering for multiclass labels.
Optuna¶
- Role: Hyperparameter optimization.
- Usage:
run_optimize_lgbm_hyperparameters.pyruns Optuna studies for detection, multiclass, or regression objectives; supports pruning, timeout, and optional pre–feature selection.
joblib¶
- Role: Serialization of models and large objects.
- Usage: Saving/loading LightGBM models and artifacts in training and deployment bundles.
Orchestration & Scheduling¶
APScheduler¶
- Role: In-process job scheduling.
- Usage:
etl_scheduler.pyusesAsyncIOSchedulerwith interval, cron, or daily triggers to run the TPL/GENKEY pipeline (or other configured pipeline) on a schedule. Logging, stats, and graceful shutdown are built on top.
Prefect 3¶
- Role: Flow orchestration and production deployment.
- Usage: Flows wrap pipeline execution (e.g.
genkey_pipeline_flow); deployments and schedules are defined inprefect.yamland managed viadeploy_flows.py/prefect_manage.sh. Progress and logs can be sent to Prefect viaPrefectProgressMonitor.
PyYAML¶
- Role: Configuration loading.
- Usage: All YAML configs (
pipelines_config.yml, ETL scheduler configs, Prefect-related config) are loaded withyaml.safe_load.
Reporting & Export¶
openpyxl¶
- Role: Excel file generation.
- Usage: Offline test pipelines write Excel reports (e.g.
offline_report.xlsx) with multiple sheets and optional styling.
Matplotlib¶
- Role: Plotting.
- Usage: Training curves, confusion matrices, diagnostic plots, and leak series in offline evaluation.
Concurrency & System¶
concurrent.futures¶
- Role: Parallel execution.
- Usage:
ProcessPoolExecutorin the features pipeline (parallel file processing);ThreadPoolExecutorin loaders for I/O.
psutil¶
- Role: System and process information.
- Usage: Optional use for resource monitoring and worker sizing.
tzlocal / pytz¶
- Role: Timezone handling.
- Usage: Scheduler and Prefect schedules use configurable timezones for cron and daily runs.
Version Bounds¶
Exact version bounds are defined in pyproject.toml under [project.optional-dependencies] (e.g. etl, ml, s3, docs). Install with: