Architecture¶

This page describes the high-level architecture of the MLOps Platform: design principles, layers, and main components.

Design Principles¶

Clean Architecture — Business logic is independent of frameworks and I/O; dependencies point inward (domain first).
Config-driven behavior — Pipelines are defined by YAML/JSON; the same code supports many environments via pipelines_config.yml and scheduler configs.
Idempotency — Runs are safe to repeat; config hashes and metadata prevent redundant work when config and inputs are unchanged.
Storage abstraction — Local and S3 are handled through mlops.storage (resolve paths, read/write Parquet/text, list files) so pipeline code stays storage-agnostic where possible.

System Overview¶

┌─────────────────────────────────────────────────────────────────┐
│                     Entry points (scripts/)                      │
│  run_*_pipeline.py  •  etl_scheduler.py  •  deploy_flows.py     │
└──────────────────────────────┬──────────────────────────────────┘
                                │
┌───────────────────────────────▼──────────────────────────────────┐
│                    Application / orchestration                    │
│  Config load  •  resolve_config_paths  •  Idempotency check      │
│  Prefect flows  •  APScheduler jobs                               │
└───────────────────────────────┬──────────────────────────────────┘
                                │
┌───────────────────────────────▼──────────────────────────────────┐
│                         Pipeline execution                       │
│  ETL: Extract → Transform → Load  (per pipeline type)             │
│  Training: lgbm_trainer  •  Optuna  •  Deploy bundle             │
│  Test offline: evaluate_all_cases  •  Reports                     │
└───────────────────────────────┬──────────────────────────────────┘
                                │
┌───────────────────────────────▼──────────────────────────────────┐
│                    Storage & I/O (mlops.storage)                  │
│  resolve_path  •  read_parquet / write_parquet  •  path_exists     │
│  Local filesystem  •  S3 (s3fs)                                  │
└───────────────────────────────────────────────────────────────────┘

Layers (as implemented)¶

1. Scripts / entry points¶

Location: scripts/.
Role: CLI, config loading, section selection, resolve_config_paths, idempotency check, and a single call into the pipeline or scheduler.
Examples: run_tpl_genkey_pipeline.py, run_features_pipeline.py, run_training_pfm_detection_pipeline.py, etl_scheduler.py, deploy_flows.py.

2. Pipeline execution¶

ETL: Each ETL pipeline type (TPL/GENKEY, Windows, Features, Parquet–CSV) has a factory that builds an Extractor, Transformer, and Loader. The generic ETLPipeline runs them in sequence; progress and Prefect logging are optional.
Training: mlops.training.lgbm_trainer (and regression/detection variants) loads data (local or S3), aligns to a feature schema, trains LightGBM, and writes model + metadata. Scripts add idempotency and deployment bundles.
Optimization: run_optimize_lgbm_hyperparameters.py runs Optuna studies using the same data and schema loading as training.
Test offline: Scripts load config, build an evaluation config, run evaluate_all_cases, then generate Excel and plots.

3. Storage abstraction¶

Location: mlops.storage (io.py, __init__.py).
Role: resolve_path(config, path) for local vs S3; read_parquet, write_parquet, path_exists, list_parquet_files, read_text, write_text; resolve_config_paths(config) to apply resolution to all known path keys. Enables pipelines to work with local or S3 without branching on URLs.

4. Domain / infrastructure (optional)¶

Location: src/ (if present) and mlops/ (etl, training, prefect).
Role: Entities and use-case interfaces may live under src; the bulk of the code is in mlops (ETL types, trainers, Prefect flows). ETL uses factories and interfaces (e.g. ETLExtractor, ETLLoader) so new pipeline types can be added without changing the runner.

ETL Pipeline Structure¶

Each ETL pipeline type follows the same pattern:

Extractor     →  list/read input (files or DB)
Transformer   →  transform batches (e.g. raw → windows, windows → features)
Loader        →  write output (Parquet, CSV, metadata, checkpoints)

TPL/GENKEY: Extract .tpl/.genkey pairs → parse and merge → write Parquet (+ optional metadata).
Windows: Extract Parquet time series → slice into fixed windows → write one Parquet per window (with “already processed” skip).
Features: Extract window Parquet → wavelet + stats → accumulate in buffer, write Parquet and checkpoints.
Parquet–CSV: Extract Parquet → optional filtering → write CSV (with overwrite/config-hash idempotency).

Progress monitoring (and Prefect logging) is pluggable via config (use_prefect_progress, prefect_logger).

Training and evaluation flow¶

Config — Script loads section from pipelines_config.yml (e.g. training_pfm_detection_pipeline).
Paths — resolve_config_paths resolves input_path, output_folder, features_schema_file (local or S3).
Idempotency — If training_metadata.json (or equivalent) exists and config_hash matches, exit without training.
Data — Load Parquet (single file or list from directory), align columns to features_schema_file, split (by row or by case).
Train — LightGBM training with validation and optional robustness (e.g. quantization); save model and metrics.
Post-processing — Detection pipelines may compute threshold metrics and write deployment bundles (inference config, schema).

Offline test pipelines follow a similar config → resolve → run evaluation → write reports pattern.

Scheduling and Prefect¶

APScheduler: etl_scheduler.py loads a YAML with scheduler and pipeline sections; it starts an AsyncIOScheduler and runs the configured pipeline (e.g. TPL/GENKEY) on interval, cron, or daily. One job at a time; graceful shutdown on SIGINT/SIGTERM.
Prefect: Flows (e.g. genkey_pipeline_flow) wrap the same pipeline execution; prefect.yaml and deployment scripts register schedules. Workers run flows; Prefect Server/UI provide observability. Progress can be reported via PrefectProgressMonitor.

Summary¶

The platform is config-driven, idempotent, and storage-aware (local + S3). Entry points are scripts and Prefect flows; core logic lives in mlops (ETL, storage, training). Documentation (this site), central config, and consistent conventions keep the system understandable and ready for production scheduling.