World-Class ETL Architecture¶
Executive Summary¶
The ETL layer is built so that:
- A generic ETL pipeline (extract → transform → load) is reused across pipeline types.
- Concrete behavior is implemented by specialized extractors, transformers, loaders, and validators (e.g. for TPL/GENKEY, Windows, Features, Parquet–CSV).
- The design follows SOLID principles and common design patterns to keep the system testable, maintainable, and performant.
Design Principles¶
- Single Responsibility — Each class has one clear role (e.g. “extract TPL/GENKEY pairs”, “transform raw to windows”, “write Parquet”).
- Open/Closed — New pipeline types are added by new implementations of the same interfaces, not by changing the core ETL loop.
- Liskov Substitution — Any extractor/transformer/loader that implements the interface can be swapped in without breaking the pipeline.
- Interface Segregation — Interfaces are small and focused (e.g.
ETLExtractor,ETLLoader). - Dependency Inversion — The pipeline depends on abstractions (interfaces), not on concrete classes; factories or config provide the concrete instances.
Patterns Used¶
- Strategy — Different pipeline “types” (TPL/GENKEY, Windows, Features) are different strategies for extract/transform/load.
- Factory — Each pipeline type has a factory (e.g.
TPLGenkeyPipelineFactory,FeaturesPipelineFactory) that builds the right extractor, transformer, loader, and validator from config. - Template method — The base
ETLPipeline.execute()defines the flow (extract batches → validate → transform → load); concrete behavior is in the injected components. - Composite — Validators can be composed (e.g. “homogeneity + schema”) so the pipeline stays a single validator from the outside.
File Layout (Conceptual)¶
mlops/etl/
├── interfaces.py # ETLExtractor, ETLTransformer, ETLLoader, ETLValidator, ETLBatch, ETLResult
├── etl_pipeline.py # ETLPipeline (orchestrator), ETLConfig
├── validators.py # Composite and base validators
├── tpl_genkey_* # TPL/GENKEY extractors, transformers, loaders, validators, factory
├── windows_* # Windows pipeline type
├── features_* # Features pipeline (extractors, transformers, loaders, parallel processor)
└── ... # Other pipeline types
Each pipeline type lives in its own module (or subpackage) and implements the same interfaces so the runner can execute them uniformly.
Data Flow¶
- Extractor — Lists input items (e.g. files), optionally skips “already processed”, and yields batches (e.g. one batch = one file or a group of files).
- Validator — Can validate each batch (e.g. row count, schema) before transform.
- Transformer — Turns a raw batch into the shape needed for load (e.g. raw → windows, or windows → feature vector).
- Loader — Writes the transformed batch to disk (or S3 when supported): Parquet, CSV, or metadata. May use buffers and checkpoints (e.g. features pipeline).
Progress monitoring (and Prefect logging) is pluggable via config so the same pipeline can run in CLI or under Prefect without code changes.
Why This Matters for You¶
- Adding a new pipeline type: Implement the four interfaces for that type and register a factory; the runner and scheduler stay unchanged.
- Testing: You can unit-test extractors, transformers, and loaders in isolation with mock data.
- Consistency: All pipelines share the same idempotency, config, and path-resolution patterns (e.g.
resolve_config_paths), so behavior is predictable across the platform.
This document describes the architecture philosophy; the exact class names and method signatures are in the codebase under mlops/etl/.