Feature Extraction (Deep Dive)¶
Role¶
The features pipeline reads window Parquet files and produces one feature vector per window: wavelet coefficients (e.g. PyWavelets, configurable wavelet and level), optional raw stats, and metadata (e.g. source file, case_id, label). Output is a single aggregated Parquet plus schema and optional checkpoints.
Why It Matters for Engineering¶
- Leak vs operational: The choice of wavelet, level, and which columns to use (see Leak detection features) directly affects how well the model can separate leak from operational transients.
- Temporal context: Optional “previous window” and delta features (see Temporal context) are implemented in or alongside this pipeline.
- Performance: Parallel file processing (ProcessPoolExecutor) and buffer-and-flush writing keep memory and I/O under control for large runs.
Configuration and Pipelines¶
- Config:
features_pipelineinpipelines_config.yml:source_folder,output_folder,feature_columns,label_column,wavelet,wavelet_level,write_buffer_rows,checkpoint_frequency,use_parallel,max_workers. - Detailed pipeline description: See Features pipeline in the main Pipelines section.
This Engineering page gives the rationale and context; the full configuration reference is in the Pipelines docs.