Skip to content

Feature Selection Pipeline

Selects the top-K features by LightGBM importance (binary classification) and writes a reduced feature schema JSON for use in training and Optuna.


Script

python scripts/run_feature_selection_pipeline.py --config configs/pipelines_config.yml

Config section: feature_selection_pipeline.


Purpose

  • Input: A Parquet dataset (e.g. input_path: single file or directory of Parquet) and optional features_schema_file (list of feature names to consider).
  • Output: A JSON file listing the selected feature names (e.g. output_schema_file: features_schema_rfe.json or custom path). A .idempotent_meta.json sidecar stores the config hash for idempotency.

Main configuration

Key Description
input_path Parquet file or directory of Parquet files (features dataset).
features_schema_file Optional schema JSON (list of columns); if missing, all numeric columns may be used.
output_schema_file Output path for the reduced schema JSON.
label_column Target column (e.g. label).
top_k_features Number of features to keep (by importance).
test_size Fraction for validation split.
random_state Random seed.
pipeline_type PFM or OBSERVER (for compatibility).
feature_columns Optional base columns to restrict selection.

Configuration reference (each item explained)

Item Meaning What it solves Notes
input_path Path to the features Parquet (single file or directory of Parquet files). Defines the dataset on which importance is computed; usually the same as used for training. Same dataset as in features_pipeline output.
features_schema_file Path to a JSON file listing all feature names (full schema). Defines the candidate set of features and their order; selection chooses a subset of these. Typically the full schema from the Features pipeline.
output_schema_file Path where the reduced schema (selected feature names only) will be written. Provides the schema for training and Optuna so they use the same subset of features. Training pipelines reference this file in features_schema_file.
pipeline_type "PFM" or "OBSERVER". Selects which training section to use as reference (e.g. for default paths or behavior). Must match the training pipelines you run later.
label_column Target column: "label" (detection), "LEAK_SIZE_in", "LEAK_LOCATION_m", or "LEAK_FLOW_kg_s". Determines the task (detection, size, location, leak flow) and thus the importance ranking. Must exist in the features dataset.
feature_columns Optional list of base signal names to restrict which derived features are considered. Limits selection to features derived from these signals; reduces noise and keeps interpretation aligned with domain. If omitted, all columns in the schema may be used.
top_k_features Maximum number of features to keep (Top-K by importance). Reduces dimensionality and overfitting; smaller K = simpler models, larger K = more capacity. null = keep all (no reduction).

Configuration template

Add this block to configs/pipelines_config.yml under the key feature_selection_pipeline:

feature_selection_pipeline:
  input_path: "data/features/SS/data/features_dataset.parquet"
  features_schema_file: "data/features/SS/metadata/features_schema.json"
  output_schema_file: "data/features/SS/metadata/features_schema_rfe_pfm_detection.json"

  pipeline_type: "PFM"   # or "OBSERVER"
  label_column: "label"  # or "LEAK_SIZE_in", "LEAK_LOCATION_m", "LEAK_FLOW_kg_s"

  feature_columns:
    - "PT 'POSITION:' 'POSITION_1378M' '(PA)' 'Pressure'"
    - "GT 'POSITION:' 'POSITION_1378M' '(KG/S)' 'Total mass flow'"
    # ... base columns to consider

  top_k_features: 32

See the full file pipelines_config.yml in the repository for the complete section.


Default values applied by the script

run_feature_selection_pipeline.py uses two layers of defaults:

  1. feature_selection_pipeline starts from an internal FEATURE_SELECTION_SECTION_DEFAULTS.
  2. Then the script resolves a base training section (training_pfm_* or training_observer_*) from pipeline_type and label_column, and inherits missing training-related keys from it.

Defaults owned by feature_selection_pipeline

Key Default
pipeline_type "PFM"
label_column "label"
input_path null -> usually inherited from the resolved training section
features_schema_file null -> usually inherited from the resolved training section
output_schema_file null -> auto-derived from features_schema_file as *_rfe.json, or features_schema_rfe.json as fallback
top_k_features 32
feature_columns null -> use all candidate features in the schema
random_state 42
test_size 0.25
shuffle true
lgbm_params {} -> no local override; base LightGBM params are inherited
num_boost_round 1000
early_stopping_rounds 50
min_improvement 0.0
verbose_eval 50

Inherited defaults from the resolved training section

Once the script maps the task to its source training section, it inherits defaults such as:

  • lgbm_params from the corresponding training script,
  • input_path and features_schema_file if they were omitted locally,
  • task-specific defaults like deviation_percentage, filter_zero_flow, filter_zero_size, filter_zero_location, or min_label when they exist.

The script also forces robustness.enabled = false during feature selection, even if the base training section enables robustness, so importance is computed on clean data.

Special case: if you set top_k_features: null explicitly, the script keeps all candidate features instead of applying a Top-K cutoff.


Idempotency

  • If output_schema_file exists and output_schema_file.idempotent_meta.json exists with the same config_hash, the pipeline exits without re-running and prints an idempotent message.

S3

  • Full S3 support: input_path and output_schema_file can be S3 paths. The pipeline uses mlops.storage for listing, reading, and writing. Install s3fs (pip install .[s3]).