Skip to content

Development Conventions

This document describes the coding and configuration conventions used across the MLOps Platform so you can contribute and extend it consistently.


Project Layout

  • configs/ — Central YAML/JSON configs; main entry is pipelines_config.yml for all pipelines. ETL scheduler configs live here too.
  • mlops/ — Core package: etl/ (pipeline types, extractors, loaders, transformers), storage/ (local + S3 I/O), training/ (LightGBM trainer, deploy bundle), prefect/ (flows, deployments).
  • scripts/ — Entry points: run_*_pipeline.py for each pipeline, etl_scheduler.py, deploy_flows.py, and helper scripts.
  • src/ — Optional domain/application layer (entities, use cases) if present.
  • docs/ — Documentation source (MkDocs); this site.
  • tests/ — Pytest tests.

Config Conventions

Central config

  • Pipeline settings live in configs/pipelines_config.yml.
  • Each pipeline has its own top-level key (e.g. tpl_genkey_pipeline, features_pipeline, training_pfm_detection_pipeline).
  • Scripts load the file and select the section by name (or via pipeline key for scheduler compatibility).

Paths and storage

  • Local: Use relative or absolute paths; relative are resolved from project root when running from repo.
  • S3: Use s3://bucket/key or set storage.type: s3, storage.bucket, and optional storage.prefix; then source_folder, output_folder, input_path etc. are resolved under that bucket/prefix.
  • Path resolution is done via mlops.storage.resolve_config_paths(config) at the start of each pipeline script.

Idempotency

  • Pipelines use a config hash (relevant keys only) and metadata files (e.g. training_metadata.json, offline_run_metadata.json) to skip work when config and inputs are unchanged.
  • ETL pipelines (TPL/GENKEY, Windows) skip files that already have output; Parquet–CSV uses overwrite: false and config hash.

Code Style

  • Formatter: Black (line length 88).
  • Imports: isort with Black-compatible profile.
  • Linting: flake8; optional mypy for type checking.
  • Docstrings: Prefer clear one-line or short multi-line descriptions; document parameters and returns where it helps.

Pipeline Script Pattern

Each run_*_pipeline.py script typically:

  1. Parses CLI (e.g. --config, overrides).
  2. Loads config from YAML/JSON and selects the right section.
  3. Calls resolve_config_paths(config) for S3/local path resolution.
  4. Checks idempotency (metadata + config hash); exits early if already run.
  5. Runs the pipeline (async or sync).
  6. Writes outputs and metadata (including hash for next idempotency check).

Naming

  • Config keys: snake_case (e.g. source_folder, output_folder, features_schema_file).
  • Pipeline sections: snake_case with _pipeline suffix (e.g. tpl_genkey_pipeline, feature_selection_pipeline).
  • Scripts: run_<name>_pipeline.py or run_<name>.py (e.g. run_tpl_genkey_pipeline.py, run_optimize_lgbm_hyperparameters.py).

Async Usage

  • ETL pipelines and the ETL scheduler use asyncio.
  • Entry points use asyncio.run(main_async(config)) or equivalent.
  • Prefect flows are defined with @flow and may call async pipeline code via run_in_executor or async entry points.

Testing

  • pytest; async tests use pytest-asyncio.
  • Place tests in tests/; mirror package layout where useful.
  • Run: pytest (optionally pytest -v --cov=mlops).

Pre-commit (optional)

If pre-commit is installed (pip install .[dev] and pre-commit install), hooks can run Black, isort, and flake8 on commit to keep the codebase consistent.