Leak Detection MLOps Platform¶
From Raw Data to Production Models
End-to-end pipelines for leak detection, feature engineering, and ML deployment
π Welcome to the Leak Detection MLOps Platform¶
The Leak Detection MLOps Platform is a production-grade Python framework for building, running, and scheduling ETL pipelines, feature extraction, model training, and offline evaluationβwith first-class support for Prefect and Amazon S3.
Our Commitment to Excellence¶
We are committed to delivering:
- World-class documentation β Clear, structured, and reader-friendly so you get value fast
- Reproducible pipelines β Idempotent runs, config hashes, and checkpointing so re-runs are safe and predictable
- Flexible storage β Local paths or S3; same configuration patterns across pipelines
- Production scheduling β APScheduler for simple cron/interval runs, Prefect 3 for observability and deployment
- Clean architecture β Domain-driven design, testable components, and consistent conventions
What Is This Platform?¶
The platform bridges raw simulation or sensor data and deployable ML models through a set of composable pipelines:
- ETL β Ingest OLGA
.tpl/.genkey(or similar) data into Parquet. - Windowing β Slice time series into fixed-size windows for modeling.
- Feature extraction β Wavelet and statistical features per window.
- Feature selection β LightGBM-based importance and top-K selection.
- Training β Binary (detection), multiclass (size/location), and regression (e.g. leak flow) with LightGBM.
- Hyperparameter optimization β Optuna-driven tuning with configurable metrics.
- Offline evaluation β Test detection and diagnostics on held-out data with Excel and plots.
All pipelines are config-driven via a central pipelines_config.yml and support local and S3 storage.
π Key Capabilities¶
π ETL & Data Prep¶
- TPL/GENKEY pipeline β Convert OLGA outputs to Parquet with selected columns, optional instrument noise, and metadata.
- Windows pipeline β Build fixed-size windows from Parquet time series; skip already-processed files (idempotent).
- Features pipeline β Wavelet-based feature extraction with configurable columns and checkpointing.
- Parquet β CSV β Export with optional leak filtering and config-hash idempotency.
π€ ML & Evaluation¶
- Feature selection β Top-K by LightGBM importance; output schema for training/Optuna.
- Training pipelines β PFM (detection, size, location, leak flow) and OBSERVER (detection, size, location) with validation metrics and deployment bundles.
- Hyperparameter optimization β Optuna with detection/multiclass/regression support and optional preβfeature selection.
- Test offline β Run models on parquet/csv data; Excel reports, confusion matrices, and diagnostic plots.
βοΈ Operations¶
- Idempotency β Same config + same inputs β no re-run (or skip already-processed files).
- S3 support β Use
s3://paths or astoragesection (bucket + prefix) in config. - Scheduling β APScheduler (interval, cron, daily) or Prefect 3 flows and deployments for production.
π Production-Ready from Day One¶
Config-Driven. Idempotent. S3-Ready.
The MLOps Platform delivers the same rigor you expect from enterprise ML pipelinesβreproducible runs, by-case validation splits, and optional Prefect deploymentβwith a single config file and no vendor lock-in.
Start with local paths, move to S3 when ready. Run once or on a schedule. Re-run safely; only what changed gets recomputed.
Who Is This For?¶
π Data Engineers
Run ETL and features at scale; local or S3, with idempotency and checkpointing.
π€ ML Engineers
Train and tune detection/size/location models with by-case validation and deployment bundles.
βοΈ DevOps / MLOps
Schedule pipelines with APScheduler or Prefect; same config, same conventions.
π¬ Research
Experiment with features and models; Engineering docs explain the philosophy and trade-offs.
π Getting Started¶
- Quick Start β Quick Start: install, config, and run your first pipeline.
- Architecture β Architecture: layers, components, and design principles.
- Pipelines β Pipelines overview: each pipeline and its configuration in detail.
- Prefect & Production β Prefect and production scheduling: deploy scheduled runs with Prefect.
- Engineering β Engineering: operational context, validation, overfitting, and ETL design.
Documentation Map¶
| Section | Description |
|---|---|
| Tech Stack | Languages, runtimes, storage, and infrastructure. |
| Frameworks & Libraries | Python packages (pandas, LightGBM, Prefect, etc.) and how they are used. |
| Development Conventions | Code style, config patterns, and project layout. |
| Objectives & Use Cases | Goals and typical workflows (R&D, production, evaluation). |
| Pipelines | Detailed pipeline descriptions and configuration reference. |
| Prefect & Production | ETL scheduler, Prefect flows, and production deployment. |
| Engineering | Philosophy, operational/temporal context, validation, overfitting, ETL design, and deep dives. |
Ready to Run Pipelines?
Start with the Quick Start, then explore each pipeline and its config.
All pipelines are idempotent and support local and S3 storage.
Where data becomes models, and models go to production.
Welcome to the MLOps Platform β world-class pipelines, one config file.