Leak Detection MLOps Platform¶

From Raw Data to Production Models

End-to-end pipelines for leak detection, feature engineering, and ML deployment

🚀 Welcome to the Leak Detection MLOps Platform¶

The Leak Detection MLOps Platform is a production-grade Python framework for building, running, and scheduling ETL pipelines, feature extraction, model training, and offline evaluation—with first-class support for Prefect and Amazon S3.

Our Commitment to Excellence¶

We are committed to delivering:

World-class documentation — Clear, structured, and reader-friendly so you get value fast
Reproducible pipelines — Idempotent runs, config hashes, and checkpointing so re-runs are safe and predictable
Flexible storage — Local paths or S3; same configuration patterns across pipelines
Production scheduling — APScheduler for simple cron/interval runs, Prefect 3 for observability and deployment
Clean architecture — Domain-driven design, testable components, and consistent conventions

What Is This Platform?¶

The platform bridges raw simulation or sensor data and deployable ML models through a set of composable pipelines:

ETL — Ingest OLGA .tpl/.genkey (or similar) data into Parquet.
Windowing — Slice time series into fixed-size windows for modeling.
Feature extraction — Wavelet and statistical features per window.
Feature selection — LightGBM-based importance and top-K selection.
Training — Binary (detection), multiclass (size/location), and regression (e.g. leak flow) with LightGBM.
Hyperparameter optimization — Optuna-driven tuning with configurable metrics.
Offline evaluation — Test detection and diagnostics on held-out data with Excel and plots.

All pipelines are config-driven via a central pipelines_config.yml and support local and S3 storage.

🔑 Key Capabilities¶

📂 ETL & Data Prep¶

TPL/GENKEY pipeline — Convert OLGA outputs to Parquet with selected columns, optional instrument noise, and metadata.
Windows pipeline — Build fixed-size windows from Parquet time series; skip already-processed files (idempotent).
Features pipeline — Wavelet-based feature extraction with configurable columns and checkpointing.
Parquet → CSV — Export with optional leak filtering and config-hash idempotency.

🤖 ML & Evaluation¶

Feature selection — Top-K by LightGBM importance; output schema for training/Optuna.
Training pipelines — PFM (detection, size, location, leak flow) and OBSERVER (detection, size, location) with validation metrics and deployment bundles.
Hyperparameter optimization — Optuna with detection/multiclass/regression support and optional pre–feature selection.
Test offline — Run models on parquet/csv data; Excel reports, confusion matrices, and diagnostic plots.

⚙️ Operations¶

Idempotency — Same config + same inputs ⇒ no re-run (or skip already-processed files).
S3 support — Use s3:// paths or a storage section (bucket + prefix) in config.
Scheduling — APScheduler (interval, cron, daily) or Prefect 3 flows and deployments for production.

💎 Production-Ready from Day One¶

Config-Driven. Idempotent. S3-Ready.

The MLOps Platform delivers the same rigor you expect from enterprise ML pipelines—reproducible runs, by-case validation splits, and optional Prefect deployment—with a single config file and no vendor lock-in.

Start with local paths, move to S3 when ready. Run once or on a schedule. Re-run safely; only what changed gets recomputed.

Who Is This For?¶

📊 Data Engineers

Run ETL and features at scale; local or S3, with idempotency and checkpointing.

🤖 ML Engineers

Train and tune detection/size/location models with by-case validation and deployment bundles.

⚙️ DevOps / MLOps

Schedule pipelines with APScheduler or Prefect; same config, same conventions.

🔬 Research

Experiment with features and models; Engineering docs explain the philosophy and trade-offs.

📚 Getting Started¶

Quick Start — Quick Start: install, config, and run your first pipeline.
Architecture — Architecture: layers, components, and design principles.
Pipelines — Pipelines overview: each pipeline and its configuration in detail.
Prefect & Production — Prefect and production scheduling: deploy scheduled runs with Prefect.
Engineering — Engineering: operational context, validation, overfitting, and ETL design.

Documentation Map¶

Section	Description
Tech Stack	Languages, runtimes, storage, and infrastructure.
Frameworks & Libraries	Python packages (pandas, LightGBM, Prefect, etc.) and how they are used.
Development Conventions	Code style, config patterns, and project layout.
Objectives & Use Cases	Goals and typical workflows (R&D, production, evaluation).
Pipelines	Detailed pipeline descriptions and configuration reference.
Prefect & Production	ETL scheduler, Prefect flows, and production deployment.
Engineering	Philosophy, operational/temporal context, validation, overfitting, ETL design, and deep dives.

Ready to Run Pipelines?

Start with the Quick Start, then explore each pipeline and its config.

All pipelines are idempotent and support local and S3 storage.

Where data becomes models, and models go to production.

Welcome to the MLOps Platform — world-class pipelines, one config file.