Skip to content

Quick Start

This guide gets you from zero to running your first pipeline in a few steps.


Prerequisites

  • Python 3.10+
  • pip (or uv/poetry; adjust commands as needed)

1. Clone and install

cd /path/to/repo
git clone <your-mlops-repo-url> mlops-platform
cd mlops-platform

Install the package with ETL and ML dependencies (and optionally S3 and docs):

pip install -e ".[etl,ml]"
# Optional: S3 and docs
# pip install -e ".[etl,ml,s3,docs]"

2. Config file

Pipelines read from a central config file. The repo includes configs/pipelines_config.yml with sections for every pipeline.

  • Source/output paths: Edit source_folder, output_folder, or input_path to point to your data and desired outputs.
  • S3 (optional): To use S3, install .[s3] and set storage.type: s3, storage.bucket, and optional storage.prefix in the desired section, or use explicit s3:// paths where supported.

Example (local):

tpl_genkey_pipeline:
  source_folder: "data/raw/my_run"
  output_folder: "data/processed/my_run"
  # ... rest of section

3. Run a pipeline

From the project root (directory containing configs/ and scripts/):

TPL/GENKEY (OLGA to Parquet):

python scripts/run_tpl_genkey_pipeline.py --config configs/pipelines_config.yml

Windows (time-series to fixed windows):

python scripts/run_windows_pipeline.py --config configs/pipelines_config.yml

Features (wavelet features from windows):

python scripts/run_features_pipeline.py --config configs/pipelines_config.yml

Training (e.g. PFM detection):

python scripts/run_training_pfm_detection_pipeline.py --config configs/pipelines_config.yml

Each script will:

  1. Load its section from pipelines_config.yml (e.g. tpl_genkey_pipeline, features_pipeline).
  2. Resolve paths (local or S3).
  3. Check idempotency; exit early if already run with same config.
  4. Execute the pipeline and write outputs.

4. Run the ETL scheduler (optional)

To run the TPL/GENKEY pipeline on a schedule (e.g. every hour):

  1. Copy or edit configs/etl_scheduler_config.yaml so the pipeline section points to your source_folder and output_folder.
  2. Set scheduler.mode (e.g. interval) and scheduler.interval_seconds (e.g. 3600).
  3. Start the scheduler:
python scripts/etl_scheduler.py --config configs/etl_scheduler_config.yaml

Or use the helper script:

./scripts/start_etl_scheduler.sh

5. Prefect (optional, for production)

For production-style scheduling with Prefect:

  1. Install Prefect deps (already in .[etl]).
  2. Use Docker Compose to run Prefect Server + worker (see docker-compose.prefect.yml and Prefect & Production).
  3. Deploy a flow with a schedule:
./prefect_manage.sh deploy daily_at_4pm

See Prefect & Production for full details.


Next steps