Architecture¶
This page describes the high-level architecture of the MLOps Platform: design principles, layers, and main components.
Design Principles¶
- Clean Architecture — Business logic is independent of frameworks and I/O; dependencies point inward (domain first).
- Config-driven behavior — Pipelines are defined by YAML/JSON; the same code supports many environments via
pipelines_config.ymland scheduler configs. - Idempotency — Runs are safe to repeat; config hashes and metadata prevent redundant work when config and inputs are unchanged.
- Storage abstraction — Local and S3 are handled through
mlops.storage(resolve paths, read/write Parquet/text, list files) so pipeline code stays storage-agnostic where possible.
System Overview¶
┌─────────────────────────────────────────────────────────────────┐
│ Entry points (scripts/) │
│ run_*_pipeline.py • etl_scheduler.py • deploy_flows.py │
└──────────────────────────────┬──────────────────────────────────┘
│
┌───────────────────────────────▼──────────────────────────────────┐
│ Application / orchestration │
│ Config load • resolve_config_paths • Idempotency check │
│ Prefect flows • APScheduler jobs │
└───────────────────────────────┬──────────────────────────────────┘
│
┌───────────────────────────────▼──────────────────────────────────┐
│ Pipeline execution │
│ ETL: Extract → Transform → Load (per pipeline type) │
│ Training: lgbm_trainer • Optuna • Deploy bundle │
│ Test offline: evaluate_all_cases • Reports │
└───────────────────────────────┬──────────────────────────────────┘
│
┌───────────────────────────────▼──────────────────────────────────┐
│ Storage & I/O (mlops.storage) │
│ resolve_path • read_parquet / write_parquet • path_exists │
│ Local filesystem • S3 (s3fs) │
└───────────────────────────────────────────────────────────────────┘
Layers (as implemented)¶
1. Scripts / entry points¶
- Location:
scripts/. - Role: CLI, config loading, section selection,
resolve_config_paths, idempotency check, and a single call into the pipeline or scheduler. - Examples:
run_tpl_genkey_pipeline.py,run_features_pipeline.py,run_training_pfm_detection_pipeline.py,etl_scheduler.py,deploy_flows.py.
2. Pipeline execution¶
- ETL: Each ETL pipeline type (TPL/GENKEY, Windows, Features, Parquet–CSV) has a factory that builds an Extractor, Transformer, and Loader. The generic
ETLPipelineruns them in sequence; progress and Prefect logging are optional. - Training:
mlops.training.lgbm_trainer(and regression/detection variants) loads data (local or S3), aligns to a feature schema, trains LightGBM, and writes model + metadata. Scripts add idempotency and deployment bundles. - Optimization:
run_optimize_lgbm_hyperparameters.pyruns Optuna studies using the same data and schema loading as training. - Test offline: Scripts load config, build an evaluation config, run
evaluate_all_cases, then generate Excel and plots.
3. Storage abstraction¶
- Location:
mlops.storage(io.py,__init__.py). - Role:
resolve_path(config, path)for local vs S3;read_parquet,write_parquet,path_exists,list_parquet_files,read_text,write_text;resolve_config_paths(config)to apply resolution to all known path keys. Enables pipelines to work with local or S3 without branching on URLs.
4. Domain / infrastructure (optional)¶
- Location:
src/(if present) andmlops/(etl, training, prefect). - Role: Entities and use-case interfaces may live under
src; the bulk of the code is inmlops(ETL types, trainers, Prefect flows). ETL uses factories and interfaces (e.g.ETLExtractor,ETLLoader) so new pipeline types can be added without changing the runner.
ETL Pipeline Structure¶
Each ETL pipeline type follows the same pattern:
Extractor → list/read input (files or DB)
Transformer → transform batches (e.g. raw → windows, windows → features)
Loader → write output (Parquet, CSV, metadata, checkpoints)
- TPL/GENKEY: Extract
.tpl/.genkeypairs → parse and merge → write Parquet (+ optional metadata). - Windows: Extract Parquet time series → slice into fixed windows → write one Parquet per window (with “already processed” skip).
- Features: Extract window Parquet → wavelet + stats → accumulate in buffer, write Parquet and checkpoints.
- Parquet–CSV: Extract Parquet → optional filtering → write CSV (with overwrite/config-hash idempotency).
Progress monitoring (and Prefect logging) is pluggable via config (use_prefect_progress, prefect_logger).
Training and evaluation flow¶
- Config — Script loads section from
pipelines_config.yml(e.g.training_pfm_detection_pipeline). - Paths —
resolve_config_pathsresolvesinput_path,output_folder,features_schema_file(local or S3). - Idempotency — If
training_metadata.json(or equivalent) exists andconfig_hashmatches, exit without training. - Data — Load Parquet (single file or list from directory), align columns to
features_schema_file, split (by row or by case). - Train — LightGBM training with validation and optional robustness (e.g. quantization); save model and metrics.
- Post-processing — Detection pipelines may compute threshold metrics and write deployment bundles (inference config, schema).
Offline test pipelines follow a similar config → resolve → run evaluation → write reports pattern.
Scheduling and Prefect¶
- APScheduler:
etl_scheduler.pyloads a YAML withschedulerandpipelinesections; it starts anAsyncIOSchedulerand runs the configured pipeline (e.g. TPL/GENKEY) on interval, cron, or daily. One job at a time; graceful shutdown on SIGINT/SIGTERM. - Prefect: Flows (e.g.
genkey_pipeline_flow) wrap the same pipeline execution;prefect.yamland deployment scripts register schedules. Workers run flows; Prefect Server/UI provide observability. Progress can be reported viaPrefectProgressMonitor.
Summary¶
The platform is config-driven, idempotent, and storage-aware (local + S3). Entry points are scripts and Prefect flows; core logic lives in mlops (ETL, storage, training). Documentation (this site), central config, and consistent conventions keep the system understandable and ready for production scheduling.