Objectives & Use Cases¶

This page describes the goals of the MLOps Platform and the main use cases it supports.

Objectives¶

End-to-end reproducibility — From raw OLGA (or similar) data to trained models and evaluation reports, with a single config file and idempotent runs.
Production-ready scheduling — Run ETL and pipelines on a schedule (APScheduler or Prefect) with logging, observability, and graceful shutdown.
Storage flexibility — Use local disk or Amazon S3 without changing pipeline logic; paths are resolved via a small set of config keys.
Clear conventions — One central config, consistent naming, and documented pipelines so new users and teams can onboard quickly.
World-class documentation — Structured, English documentation (including this site) so the platform is easy to adopt and extend.

Goal: Experiment with feature sets, models, and hyperparameters on fixed datasets.
How: Run TPL/GENKEY → Windows → Features → Feature selection (optional) → Training or Optuna. Use the same pipelines_config.yml; change only the relevant keys (e.g. features_schema_file, lgbm_params). Idempotency avoids re-running unchanged stages.
Outputs: Trained models, metrics, and (for Optuna) study results and optimized params.

Goal: Ingest new raw data periodically and produce Parquet (and optionally CSV) for downstream consumers.
How: Schedule the TPL/GENKEY pipeline (and optionally Windows/Features) via APScheduler (etl_scheduler.py) or Prefect. Config points to source_folder and output_folder (local or S3). Already-processed files are skipped.
Outputs: Parquet (and CSV if used) in output_folder; logs and run metadata.

Goal: Train detection/size/location/leak-flow models and produce artifacts for deployment.
How: Run the appropriate training script (e.g. run_training_pfm_detection_pipeline.py) with pipelines_config.yml. Use features_schema_file to restrict features. Outputs include model_lgbm.txt, metrics, and (for detection) deployment bundles (e.g. inference config, schema).
Outputs: Model files, training_metadata.json, metrics, and deployment bundles under output_folder.

Goal: Find better hyperparameters for a given pipeline type (detection, multiclass, regression).
How: Run run_optimize_lgbm_hyperparameters.py with the same config section; optionally enable pre–feature selection. Optuna runs the configured number of trials and saves best params and study.
Outputs: optimized_params.json, optuna_study.json, and (if enabled) a new feature schema.

Goal: Evaluate trained models on held-out Parquet/CSV data and generate reports.
How: Run run_test_offline_pipeline.py or run_pfm_test_offline_pipeline.py with config that points to parquet_root, model paths, and output_report_folder. Reports include Excel, confusion matrices, and diagnostic plots.
Outputs: Excel report, plots, and JSON results in output_report_folder.

Goal: Run the same pipelines when data and outputs live in S3.
How: Set storage.type: s3, storage.bucket, and optional storage.prefix in the pipeline section (or use explicit s3:// paths). Install s3fs (pip install .[s3]). Feature selection, training, Optuna, and test offline support S3; ETL pipelines resolve paths but currently require local I/O (S3 support in ETL is planned).
Outputs: Same as above, written to S3.

Goal: Run the TPL/GENKEY (or other) pipeline on a schedule in production with observability and retries.
How: Deploy Prefect flows using prefect.yaml and deploy_flows.py (or prefect_manage.sh). Configure schedule (e.g. every 5 minutes, daily at 4 PM). Worker pulls and runs the flow; Prefect UI shows runs and logs.
Outputs: Same as the underlying pipeline; execution history and logs in Prefect.

Use case	Main scripts / components	Scheduling option
R&D (full pipeline)	All `run_*_pipeline.py`	Manual / cron
Production ETL	TPL/GENKEY, Windows, Features	APScheduler or Prefect
Model training	Training scripts	Manual / CI
Hyperparameter tuning	`run_optimize_lgbm_hyperparameters.py`	Manual
Offline evaluation	Test offline scripts	Manual
S3 workflows	Same scripts + S3 config	Same as above
Prefect production	Prefect flows + worker	Prefect schedules