ETL Scheduler (APScheduler)¶

Role¶

The ETL scheduler runs a pipeline (typically TPL/GENKEY) on a schedule: by interval (e.g. every N seconds), by cron expression, or daily at fixed times. It is implemented with APScheduler (async) and is suitable for single-machine, in-process scheduling.

Features¶

Idempotency: The underlying pipeline skips already-processed files; re-runs are safe.
Single job at a time: Prevents overlapping runs and resource contention.
Modes: interval, cron, daily, multiple_daily (see Scheduler configs).
Logging: Rotating file and console; configurable level and format.
Graceful shutdown: Handles SIGINT/SIGTERM and stops the scheduler cleanly.
Stats: Success/failure counts and optional alerting on consecutive failures.

Config¶

The scheduler reads a YAML with two main sections:

pipeline: Same keys as the pipeline section in pipelines_config.yml (e.g. source_folder, output_folder, selected_columns, max_workers).
scheduler: mode, interval_seconds (for interval), cron_expression (for cron), daily_time or daily_times, timezone, and logging (file, level, format, rotation).

Running¶

python scripts/etl_scheduler.py --config configs/etl_scheduler_config.yaml

One-shot (no schedule): --run-once
Status: --status to print next run and basic stats

For production with observability and retries, use Prefect. For full config options, see Scheduler configs.