Preprocessor YAML (WIP)

Preprocessor YAML (WIP)

YAML configuration file format for the Preprocessor module, used for data preprocessing.

Usage Examples

Click the button below to run the example in Colab:

Open In Colab

Using Default Preprocessing

Preprocessor:
  demo:
    method: 'default'

Using Custom Processing Sequence

Preprocessor:
  custom:
    method: 'default'
    sequence:
      - missing
      - outlier
      - encoder
      - scaler

Customizing Processors for Specific Fields

Preprocessor:
  custom_fields:
    method: 'default'
    config:
      missing:
        age: 'missing_mean'
        income: 'missing_median'
      outlier:
        age: 'outlier_zscore'
        income: 'outlier_iqr'
      encoder:
        gender: 'encoder_onehot'
        education: 'encoder_label'
      scaler:
        age: 'scaler_minmax'
        income: 'scaler_standard'

Main Parameters

  • method (string, required)

    • Preprocessing method
    • Available values: 'default' (default processing sequence)
  • sequence (list, optional)

    • Custom processing sequence
    • Available values: 'missing', 'outlier', 'encoder', 'scaler', 'discretizing'
    • Default value: ['missing', 'outlier', 'encoder', 'scaler']
  • config (dict, optional)

    • Custom processor configuration for each field
    • Structure: {processing_type: {field_name: processing_method}}

Processing Sequence

Preprocessor supports the following processing steps, executed in order:

  1. missing: Missing value handling
  2. outlier: Outlier handling
  3. encoder: Categorical variable encoding
  4. scaler: Numerical normalization
  5. discretizing: Discretization (mutually exclusive with encoder)

Default Processing Methods

Processing TypeNumericalCategoricalDatetime
missingmissing_meanmissing_dropmissing_drop
outlieroutlier_iqrNoneoutlier_iqr
encoderNoneencoder_uniformNone
scalerscaler_standardNonescaler_standard
discretizingdiscretizing_kbinsencoder_labeldiscretizing_kbins

Feature Documentation

For detailed processor descriptions, please refer to each feature page:

Precision Preservation

Preprocessor automatically preserves the precision of numerical fields:

  • Precision Retention: The type_attr.precision in the schema will not be changed during transformation
  • Automatic Application: Rounding is automatically applied according to precision after transformation
  • Memory Mechanism: Precision information is recorded in Status for use by subsequent modules

Execution Instructions

  • Experiment names (second level) can be freely named; descriptive names are recommended
  • Multiple experiments can be defined and will be executed sequentially
  • Preprocessing results are passed to the Synthesizer module

Notes

  • discretizing and encoder cannot be used simultaneously
  • discretizing must be the last step in the sequence
  • Some outlier handlers (such as outlier_isolationforest, outlier_lof) are global transformations and will be applied to all fields
  • Custom config will override default settings
  • Precision is automatically applied after preprocessing transformation to ensure numerical consistency
  • For detailed processor parameter settings, please refer to each feature page and the Processor API documentation