Skip to content

ETL Scheduler (APScheduler)

Role

The ETL scheduler runs a pipeline (typically TPL/GENKEY) on a schedule: by interval (e.g. every N seconds), by cron expression, or daily at fixed times. It is implemented with APScheduler (async) and is suitable for single-machine, in-process scheduling.


Features

  • Idempotency: The underlying pipeline skips already-processed files; re-runs are safe.
  • Single job at a time: Prevents overlapping runs and resource contention.
  • Modes: interval, cron, daily, multiple_daily (see Scheduler configs).
  • Logging: Rotating file and console; configurable level and format.
  • Graceful shutdown: Handles SIGINT/SIGTERM and stops the scheduler cleanly.
  • Stats: Success/failure counts and optional alerting on consecutive failures.

Config

The scheduler reads a YAML with two main sections:

  • pipeline: Same keys as the pipeline section in pipelines_config.yml (e.g. source_folder, output_folder, selected_columns, max_workers).
  • scheduler: mode, interval_seconds (for interval), cron_expression (for cron), daily_time or daily_times, timezone, and logging (file, level, format, rotation).

Running

python scripts/etl_scheduler.py --config configs/etl_scheduler_config.yaml
  • One-shot (no schedule): --run-once
  • Status: --status to print next run and basic stats

For production with observability and retries, use Prefect. For full config options, see Scheduler configs.