Development Conventions¶
This document describes the coding and configuration conventions used across the MLOps Platform so you can contribute and extend it consistently.
Project Layout¶
configs/— Central YAML/JSON configs; main entry ispipelines_config.ymlfor all pipelines. ETL scheduler configs live here too.mlops/— Core package:etl/(pipeline types, extractors, loaders, transformers),storage/(local + S3 I/O),training/(LightGBM trainer, deploy bundle),prefect/(flows, deployments).scripts/— Entry points:run_*_pipeline.pyfor each pipeline,etl_scheduler.py,deploy_flows.py, and helper scripts.src/— Optional domain/application layer (entities, use cases) if present.docs/— Documentation source (MkDocs); this site.tests/— Pytest tests.
Config Conventions¶
Central config¶
- Pipeline settings live in
configs/pipelines_config.yml. - Each pipeline has its own top-level key (e.g.
tpl_genkey_pipeline,features_pipeline,training_pfm_detection_pipeline). - Scripts load the file and select the section by name (or via
pipelinekey for scheduler compatibility).
Paths and storage¶
- Local: Use relative or absolute paths; relative are resolved from project root when running from repo.
- S3: Use
s3://bucket/keyor setstorage.type: s3,storage.bucket, and optionalstorage.prefix; thensource_folder,output_folder,input_pathetc. are resolved under that bucket/prefix. - Path resolution is done via
mlops.storage.resolve_config_paths(config)at the start of each pipeline script.
Idempotency¶
- Pipelines use a config hash (relevant keys only) and metadata files (e.g.
training_metadata.json,offline_run_metadata.json) to skip work when config and inputs are unchanged. - ETL pipelines (TPL/GENKEY, Windows) skip files that already have output; Parquet–CSV uses
overwrite: falseand config hash.
Code Style¶
- Formatter: Black (line length 88).
- Imports: isort with Black-compatible profile.
- Linting: flake8; optional mypy for type checking.
- Docstrings: Prefer clear one-line or short multi-line descriptions; document parameters and returns where it helps.
Pipeline Script Pattern¶
Each run_*_pipeline.py script typically:
- Parses CLI (e.g.
--config, overrides). - Loads config from YAML/JSON and selects the right section.
- Calls
resolve_config_paths(config)for S3/local path resolution. - Checks idempotency (metadata + config hash); exits early if already run.
- Runs the pipeline (async or sync).
- Writes outputs and metadata (including hash for next idempotency check).
Naming¶
- Config keys:
snake_case(e.g.source_folder,output_folder,features_schema_file). - Pipeline sections:
snake_casewith_pipelinesuffix (e.g.tpl_genkey_pipeline,feature_selection_pipeline). - Scripts:
run_<name>_pipeline.pyorrun_<name>.py(e.g.run_tpl_genkey_pipeline.py,run_optimize_lgbm_hyperparameters.py).
Async Usage¶
- ETL pipelines and the ETL scheduler use asyncio.
- Entry points use
asyncio.run(main_async(config))or equivalent. - Prefect flows are defined with
@flowand may call async pipeline code viarun_in_executoror async entry points.
Testing¶
- pytest; async tests use pytest-asyncio.
- Place tests in
tests/; mirror package layout where useful. - Run:
pytest(optionallypytest -v --cov=mlops).
Pre-commit (optional)¶
If pre-commit is installed (pip install .[dev] and pre-commit install), hooks can run Black, isort, and flake8 on commit to keep the codebase consistent.