Preprocessor YAML
YAML configuration file format for the Preprocessor module, used for data preprocessing.
Usage Examples
Click the button below to run the example in Colab:
Note: If using Colab, please see the runtime setup guide.
Using Default Preprocessing
Preprocessor:
demo:
method: 'default'Using Custom Processing Sequence
Preprocessor:
custom:
method: 'default'
sequence:
- missing
- outlier
- encoder
- scalerCustomizing Processors for Specific Fields
Preprocessor:
custom_fields:
method: 'default'
config:
missing:
age: 'missing_mean'
income: 'missing_median'
outlier:
age: 'outlier_zscore'
income: 'outlier_iqr'
encoder:
gender: 'encoder_onehot'
education: 'encoder_label'
scaler:
age: 'scaler_minmax'
income: 'scaler_standard'Main Parameters
method (
string, required)- Preprocessing method
- Available values:
'default'(default processing sequence)
sequence (
list, optional)- Custom processing sequence
- Available values:
'missing','outlier','encoder','scaler','discretizing' - Default value:
['missing', 'outlier', 'encoder', 'scaler']
config (
dict, optional)- Custom processor configuration for each field
- Structure:
{processing_type: {field_name: processing_method}}
Processing Sequence
Preprocessor supports the following processing steps, executed in order:
- missing: Missing value handling
- outlier: Outlier handling
- encoder: Categorical variable encoding
- scaler: Numerical normalization
- discretizing: Discretization (mutually exclusive with encoder)
Default Processing Methods
| Processing Type | Numerical | Categorical | Datetime |
|---|---|---|---|
| missing | missing_mean | missing_drop | missing_drop |
| outlier | outlier_iqr | None | outlier_iqr |
| encoder | None | encoder_uniform | None |
| scaler | scaler_standard | None | scaler_standard |
| discretizing | discretizing_kbins | encoder_label | discretizing_kbins |
Feature Documentation
For detailed processor descriptions, please refer to each feature page:
Precision Preservation
Preprocessor automatically preserves the precision of numerical fields:
- Precision Retention: The
type_attr.precisionin the schema will not be changed during transformation - Automatic Application: Rounding is automatically applied according to precision after transformation
- Memory Mechanism: Precision information is recorded in Status for use by subsequent modules
Execution Instructions
- Experiment names (second level) can be freely named; descriptive names are recommended
- Multiple experiments can be defined and will be executed sequentially
- Preprocessing results are passed to the Synthesizer module
Notes
discretizingandencodercannot be used simultaneouslydiscretizingmust be the last step in the sequence- Some outlier handlers (such as
outlier_isolationforest,outlier_lof) are global transformations and will be applied to all fields - Custom config will override default settings
- Precision is automatically applied after preprocessing transformation to ensure numerical consistency
- For detailed processor parameter settings, please refer to each feature page and the Processor API documentation