Skip to content

Feature Extraction (Deep Dive)

Role

The features pipeline reads window Parquet files and produces one feature vector per window: wavelet coefficients (e.g. PyWavelets, configurable wavelet and level), optional raw stats, and metadata (e.g. source file, case_id, label). Output is a single aggregated Parquet plus schema and optional checkpoints.


Why It Matters for Engineering

  • Leak vs operational: The choice of wavelet, level, and which columns to use (see Leak detection features) directly affects how well the model can separate leak from operational transients.
  • Temporal context: Optional “previous window” and delta features (see Temporal context) are implemented in or alongside this pipeline.
  • Performance: Parallel file processing (ProcessPoolExecutor) and buffer-and-flush writing keep memory and I/O under control for large runs.

Configuration and Pipelines

  • Config: features_pipeline in pipelines_config.yml: source_folder, output_folder, feature_columns, label_column, wavelet, wavelet_level, write_buffer_rows, checkpoint_frequency, use_parallel, max_workers.
  • Detailed pipeline description: See Features pipeline in the main Pipelines section.

This Engineering page gives the rationale and context; the full configuration reference is in the Pipelines docs.