Feature Selection Pipeline¶

Selects the top-K features by LightGBM importance (binary classification) and writes a reduced feature schema JSON for use in training and Optuna.

Script¶

python scripts/run_feature_selection_pipeline.py --config configs/pipelines_config.yml

Config section: feature_selection_pipeline.

Purpose¶

Input: A Parquet dataset (e.g. input_path: single file or directory of Parquet) and optional features_schema_file (list of feature names to consider).
Output: A JSON file listing the selected feature names (e.g. output_schema_file: features_schema_rfe.json or custom path). A .idempotent_meta.json sidecar stores the config hash for idempotency.

Main configuration¶

Key	Description
`input_path`	Parquet file or directory of Parquet files (features dataset).
`features_schema_file`	Optional schema JSON (list of columns); if missing, all numeric columns may be used.
`output_schema_file`	Output path for the reduced schema JSON.
`label_column`	Target column (e.g. `label`).
`top_k_features`	Number of features to keep (by importance).
`test_size`	Fraction for validation split.
`random_state`	Random seed.
`pipeline_type`	`PFM` or `OBSERVER` (for compatibility).
`feature_columns`	Optional base columns to restrict selection.

Configuration reference (each item explained)¶

Item	Meaning	What it solves	Notes
`input_path`	Path to the features Parquet (single file or directory of Parquet files).	Defines the dataset on which importance is computed; usually the same as used for training.	Same dataset as in `features_pipeline` output.
`features_schema_file`	Path to a JSON file listing all feature names (full schema).	Defines the candidate set of features and their order; selection chooses a subset of these.	Typically the full schema from the Features pipeline.
`output_schema_file`	Path where the reduced schema (selected feature names only) will be written.	Provides the schema for training and Optuna so they use the same subset of features.	Training pipelines reference this file in `features_schema_file`.
`pipeline_type`	`"PFM"` or `"OBSERVER"`.	Selects which training section to use as reference (e.g. for default paths or behavior).	Must match the training pipelines you run later.
`label_column`	Target column: `"label"` (detection), `"LEAK_SIZE_in"`, `"LEAK_LOCATION_m"`, or `"LEAK_FLOW_kg_s"`.	Determines the task (detection, size, location, leak flow) and thus the importance ranking.	Must exist in the features dataset.
`feature_columns`	Optional list of base signal names to restrict which derived features are considered.	Limits selection to features derived from these signals; reduces noise and keeps interpretation aligned with domain.	If omitted, all columns in the schema may be used.
`top_k_features`	Maximum number of features to keep (Top-K by importance).	Reduces dimensionality and overfitting; smaller K = simpler models, larger K = more capacity.	`null` = keep all (no reduction).

Configuration template¶

Add this block to configs/pipelines_config.yml under the key feature_selection_pipeline:

feature_selection_pipeline:
  input_path: "data/features/SS/data/features_dataset.parquet"
  features_schema_file: "data/features/SS/metadata/features_schema.json"
  output_schema_file: "data/features/SS/metadata/features_schema_rfe_pfm_detection.json"

  pipeline_type: "PFM"   # or "OBSERVER"
  label_column: "label"  # or "LEAK_SIZE_in", "LEAK_LOCATION_m", "LEAK_FLOW_kg_s"

  feature_columns:
    - "PT 'POSITION:' 'POSITION_1378M' '(PA)' 'Pressure'"
    - "GT 'POSITION:' 'POSITION_1378M' '(KG/S)' 'Total mass flow'"
    # ... base columns to consider

  top_k_features: 32

See the full file pipelines_config.yml in the repository for the complete section.

Default values applied by the script¶

run_feature_selection_pipeline.py uses two layers of defaults:

feature_selection_pipeline starts from an internal FEATURE_SELECTION_SECTION_DEFAULTS.
Then the script resolves a base training section (training_pfm_* or training_observer_*) from pipeline_type and label_column, and inherits missing training-related keys from it.

Defaults owned by `feature_selection_pipeline`¶

Key	Default
`pipeline_type`	`"PFM"`
`label_column`	`"label"`
`input_path`	`null` -> usually inherited from the resolved training section
`features_schema_file`	`null` -> usually inherited from the resolved training section
`output_schema_file`	`null` -> auto-derived from `features_schema_file` as `*_rfe.json`, or `features_schema_rfe.json` as fallback
`top_k_features`	`32`
`feature_columns`	`null` -> use all candidate features in the schema
`random_state`	`42`
`test_size`	`0.25`
`shuffle`	`true`
`lgbm_params`	`{}` -> no local override; base LightGBM params are inherited
`num_boost_round`	`1000`
`early_stopping_rounds`	`50`
`min_improvement`	`0.0`
`verbose_eval`	`50`

Inherited defaults from the resolved training section¶

Once the script maps the task to its source training section, it inherits defaults such as:

lgbm_params from the corresponding training script,
input_path and features_schema_file if they were omitted locally,
task-specific defaults like deviation_percentage, filter_zero_flow, filter_zero_size, filter_zero_location, or min_label when they exist.

The script also forces robustness.enabled = false during feature selection, even if the base training section enables robustness, so importance is computed on clean data.

Special case: if you set top_k_features: null explicitly, the script keeps all candidate features instead of applying a Top-K cutoff.

Idempotency¶

If output_schema_file exists and output_schema_file.idempotent_meta.json exists with the same config_hash, the pipeline exits without re-running and prints an idempotent message.

S3¶

Full S3 support: input_path and output_schema_file can be S3 paths. The pipeline uses mlops.storage for listing, reading, and writing. Install s3fs (pip install .[s3]).