Feature Selection Pipeline¶
Selects the top-K features by LightGBM importance (binary classification) and writes a reduced feature schema JSON for use in training and Optuna.
Script¶
Config section: feature_selection_pipeline.
Purpose¶
- Input: A Parquet dataset (e.g.
input_path: single file or directory of Parquet) and optionalfeatures_schema_file(list of feature names to consider). - Output: A JSON file listing the selected feature names (e.g.
output_schema_file:features_schema_rfe.jsonor custom path). A.idempotent_meta.jsonsidecar stores the config hash for idempotency.
Main configuration¶
| Key | Description |
|---|---|
input_path |
Parquet file or directory of Parquet files (features dataset). |
features_schema_file |
Optional schema JSON (list of columns); if missing, all numeric columns may be used. |
output_schema_file |
Output path for the reduced schema JSON. |
label_column |
Target column (e.g. label). |
top_k_features |
Number of features to keep (by importance). |
test_size |
Fraction for validation split. |
random_state |
Random seed. |
pipeline_type |
PFM or OBSERVER (for compatibility). |
feature_columns |
Optional base columns to restrict selection. |
Configuration reference (each item explained)¶
| Item | Meaning | What it solves | Notes |
|---|---|---|---|
input_path |
Path to the features Parquet (single file or directory of Parquet files). | Defines the dataset on which importance is computed; usually the same as used for training. | Same dataset as in features_pipeline output. |
features_schema_file |
Path to a JSON file listing all feature names (full schema). | Defines the candidate set of features and their order; selection chooses a subset of these. | Typically the full schema from the Features pipeline. |
output_schema_file |
Path where the reduced schema (selected feature names only) will be written. | Provides the schema for training and Optuna so they use the same subset of features. | Training pipelines reference this file in features_schema_file. |
pipeline_type |
"PFM" or "OBSERVER". |
Selects which training section to use as reference (e.g. for default paths or behavior). | Must match the training pipelines you run later. |
label_column |
Target column: "label" (detection), "LEAK_SIZE_in", "LEAK_LOCATION_m", or "LEAK_FLOW_kg_s". |
Determines the task (detection, size, location, leak flow) and thus the importance ranking. | Must exist in the features dataset. |
feature_columns |
Optional list of base signal names to restrict which derived features are considered. | Limits selection to features derived from these signals; reduces noise and keeps interpretation aligned with domain. | If omitted, all columns in the schema may be used. |
top_k_features |
Maximum number of features to keep (Top-K by importance). | Reduces dimensionality and overfitting; smaller K = simpler models, larger K = more capacity. | null = keep all (no reduction). |
Configuration template¶
Add this block to configs/pipelines_config.yml under the key feature_selection_pipeline:
feature_selection_pipeline:
input_path: "data/features/SS/data/features_dataset.parquet"
features_schema_file: "data/features/SS/metadata/features_schema.json"
output_schema_file: "data/features/SS/metadata/features_schema_rfe_pfm_detection.json"
pipeline_type: "PFM" # or "OBSERVER"
label_column: "label" # or "LEAK_SIZE_in", "LEAK_LOCATION_m", "LEAK_FLOW_kg_s"
feature_columns:
- "PT 'POSITION:' 'POSITION_1378M' '(PA)' 'Pressure'"
- "GT 'POSITION:' 'POSITION_1378M' '(KG/S)' 'Total mass flow'"
# ... base columns to consider
top_k_features: 32
See the full file pipelines_config.yml in the repository for the complete section.
Default values applied by the script¶
run_feature_selection_pipeline.py uses two layers of defaults:
feature_selection_pipelinestarts from an internalFEATURE_SELECTION_SECTION_DEFAULTS.- Then the script resolves a base training section (
training_pfm_*ortraining_observer_*) frompipeline_typeandlabel_column, and inherits missing training-related keys from it.
Defaults owned by feature_selection_pipeline¶
| Key | Default |
|---|---|
pipeline_type |
"PFM" |
label_column |
"label" |
input_path |
null -> usually inherited from the resolved training section |
features_schema_file |
null -> usually inherited from the resolved training section |
output_schema_file |
null -> auto-derived from features_schema_file as *_rfe.json, or features_schema_rfe.json as fallback |
top_k_features |
32 |
feature_columns |
null -> use all candidate features in the schema |
random_state |
42 |
test_size |
0.25 |
shuffle |
true |
lgbm_params |
{} -> no local override; base LightGBM params are inherited |
num_boost_round |
1000 |
early_stopping_rounds |
50 |
min_improvement |
0.0 |
verbose_eval |
50 |
Inherited defaults from the resolved training section¶
Once the script maps the task to its source training section, it inherits defaults such as:
lgbm_paramsfrom the corresponding training script,input_pathandfeatures_schema_fileif they were omitted locally,- task-specific defaults like
deviation_percentage,filter_zero_flow,filter_zero_size,filter_zero_location, ormin_labelwhen they exist.
The script also forces robustness.enabled = false during feature selection, even if the
base training section enables robustness, so importance is computed on clean data.
Special case: if you set top_k_features: null explicitly, the script keeps all candidate
features instead of applying a Top-K cutoff.
Idempotency¶
- If
output_schema_fileexists andoutput_schema_file.idempotent_meta.jsonexists with the sameconfig_hash, the pipeline exits without re-running and prints an idempotent message.
S3¶
- Full S3 support:
input_pathandoutput_schema_filecan be S3 paths. The pipeline usesmlops.storagefor listing, reading, and writing. Installs3fs(pip install .[s3]).