Development Conventions¶

This document describes the coding and configuration conventions used across the MLOps Platform so you can contribute and extend it consistently.

Project Layout¶

configs/ — Central YAML/JSON configs; main entry is pipelines_config.yml for all pipelines. ETL scheduler configs live here too.
mlops/ — Core package: etl/ (pipeline types, extractors, loaders, transformers), storage/ (local + S3 I/O), training/ (LightGBM trainer, deploy bundle), prefect/ (flows, deployments).
scripts/ — Entry points: run_*_pipeline.py for each pipeline, etl_scheduler.py, deploy_flows.py, and helper scripts.
src/ — Optional domain/application layer (entities, use cases) if present.
docs/ — Documentation source (MkDocs); this site.
tests/ — Pytest tests.

Config Conventions¶

Central config¶

Pipeline settings live in configs/pipelines_config.yml.
Each pipeline has its own top-level key (e.g. tpl_genkey_pipeline, features_pipeline, training_pfm_detection_pipeline).
Scripts load the file and select the section by name (or via pipeline key for scheduler compatibility).

Paths and storage¶

Local: Use relative or absolute paths; relative are resolved from project root when running from repo.
S3: Use s3://bucket/key or set storage.type: s3, storage.bucket, and optional storage.prefix; then source_folder, output_folder, input_path etc. are resolved under that bucket/prefix.
Path resolution is done via mlops.storage.resolve_config_paths(config) at the start of each pipeline script.

Idempotency¶

Pipelines use a config hash (relevant keys only) and metadata files (e.g. training_metadata.json, offline_run_metadata.json) to skip work when config and inputs are unchanged.
ETL pipelines (TPL/GENKEY, Windows) skip files that already have output; Parquet–CSV uses overwrite: false and config hash.

Code Style¶

Formatter: Black (line length 88).
Imports: isort with Black-compatible profile.
Linting: flake8; optional mypy for type checking.
Docstrings: Prefer clear one-line or short multi-line descriptions; document parameters and returns where it helps.

Pipeline Script Pattern¶

Each run_*_pipeline.py script typically:

Parses CLI (e.g. --config, overrides).
Loads config from YAML/JSON and selects the right section.
Calls resolve_config_paths(config) for S3/local path resolution.
Checks idempotency (metadata + config hash); exits early if already run.
Runs the pipeline (async or sync).
Writes outputs and metadata (including hash for next idempotency check).

Naming¶

Config keys: snake_case (e.g. source_folder, output_folder, features_schema_file).
Pipeline sections: snake_case with _pipeline suffix (e.g. tpl_genkey_pipeline, feature_selection_pipeline).
Scripts: run_<name>_pipeline.py or run_<name>.py (e.g. run_tpl_genkey_pipeline.py, run_optimize_lgbm_hyperparameters.py).

Async Usage¶

ETL pipelines and the ETL scheduler use asyncio.
Entry points use asyncio.run(main_async(config)) or equivalent.
Prefect flows are defined with @flow and may call async pipeline code via run_in_executor or async entry points.

Testing¶

pytest; async tests use pytest-asyncio.
Place tests in tests/; mirror package layout where useful.
Run: pytest (optionally pytest -v --cov=mlops).

Pre-commit (optional)¶

If pre-commit is installed (pip install .[dev] and pre-commit install), hooks can run Black, isort, and flake8 on commit to keep the codebase consistent.