Skip to content

Objectives & Use Cases

This page describes the goals of the MLOps Platform and the main use cases it supports.


Objectives

  1. End-to-end reproducibility — From raw OLGA (or similar) data to trained models and evaluation reports, with a single config file and idempotent runs.
  2. Production-ready scheduling — Run ETL and pipelines on a schedule (APScheduler or Prefect) with logging, observability, and graceful shutdown.
  3. Storage flexibility — Use local disk or Amazon S3 without changing pipeline logic; paths are resolved via a small set of config keys.
  4. Clear conventions — One central config, consistent naming, and documented pipelines so new users and teams can onboard quickly.
  5. World-class documentation — Structured, English documentation (including this site) so the platform is easy to adopt and extend.

Use Cases

1. Research & development

  • Goal: Experiment with feature sets, models, and hyperparameters on fixed datasets.
  • How: Run TPL/GENKEY → Windows → Features → Feature selection (optional) → Training or Optuna. Use the same pipelines_config.yml; change only the relevant keys (e.g. features_schema_file, lgbm_params). Idempotency avoids re-running unchanged stages.
  • Outputs: Trained models, metrics, and (for Optuna) study results and optimized params.

2. Production ETL

  • Goal: Ingest new raw data periodically and produce Parquet (and optionally CSV) for downstream consumers.
  • How: Schedule the TPL/GENKEY pipeline (and optionally Windows/Features) via APScheduler (etl_scheduler.py) or Prefect. Config points to source_folder and output_folder (local or S3). Already-processed files are skipped.
  • Outputs: Parquet (and CSV if used) in output_folder; logs and run metadata.

3. Model training and deployment

  • Goal: Train detection/size/location/leak-flow models and produce artifacts for deployment.
  • How: Run the appropriate training script (e.g. run_training_pfm_detection_pipeline.py) with pipelines_config.yml. Use features_schema_file to restrict features. Outputs include model_lgbm.txt, metrics, and (for detection) deployment bundles (e.g. inference config, schema).
  • Outputs: Model files, training_metadata.json, metrics, and deployment bundles under output_folder.

4. Hyperparameter tuning

  • Goal: Find better hyperparameters for a given pipeline type (detection, multiclass, regression).
  • How: Run run_optimize_lgbm_hyperparameters.py with the same config section; optionally enable pre–feature selection. Optuna runs the configured number of trials and saves best params and study.
  • Outputs: optimized_params.json, optuna_study.json, and (if enabled) a new feature schema.

5. Offline evaluation

  • Goal: Evaluate trained models on held-out Parquet/CSV data and generate reports.
  • How: Run run_test_offline_pipeline.py or run_pfm_test_offline_pipeline.py with config that points to parquet_root, model paths, and output_report_folder. Reports include Excel, confusion matrices, and diagnostic plots.
  • Outputs: Excel report, plots, and JSON results in output_report_folder.

6. S3-backed workflows

  • Goal: Run the same pipelines when data and outputs live in S3.
  • How: Set storage.type: s3, storage.bucket, and optional storage.prefix in the pipeline section (or use explicit s3:// paths). Install s3fs (pip install .[s3]). Feature selection, training, Optuna, and test offline support S3; ETL pipelines resolve paths but currently require local I/O (S3 support in ETL is planned).
  • Outputs: Same as above, written to S3.

7. Scheduled production runs with Prefect

  • Goal: Run the TPL/GENKEY (or other) pipeline on a schedule in production with observability and retries.
  • How: Deploy Prefect flows using prefect.yaml and deploy_flows.py (or prefect_manage.sh). Configure schedule (e.g. every 5 minutes, daily at 4 PM). Worker pulls and runs the flow; Prefect UI shows runs and logs.
  • Outputs: Same as the underlying pipeline; execution history and logs in Prefect.

Summary

Use case Main scripts / components Scheduling option
R&D (full pipeline) All run_*_pipeline.py Manual / cron
Production ETL TPL/GENKEY, Windows, Features APScheduler or Prefect
Model training Training scripts Manual / CI
Hyperparameter tuning run_optimize_lgbm_hyperparameters.py Manual
Offline evaluation Test offline scripts Manual
S3 workflows Same scripts + S3 config Same as above
Prefect production Prefect flows + worker Prefect schedules