Preprocessor YAML (WIP)
YAML configuration file format for the Preprocessor module, used for data preprocessing.
Usage Examples
Click the button below to run the example in Colab:
Using Default Preprocessing
Preprocessor:
demo:
method: 'default'
Using Custom Processing Sequence
Preprocessor:
custom:
method: 'default'
sequence:
- missing
- outlier
- encoder
- scaler
Customizing Processors for Specific Fields
Preprocessor:
custom_fields:
method: 'default'
config:
missing:
age: 'missing_mean'
income: 'missing_median'
outlier:
age: 'outlier_zscore'
income: 'outlier_iqr'
encoder:
gender: 'encoder_onehot'
education: 'encoder_label'
scaler:
age: 'scaler_minmax'
income: 'scaler_standard'
Main Parameters
method (
string
, required)- Preprocessing method
- Available values:
'default'
(default processing sequence)
sequence (
list
, optional)- Custom processing sequence
- Available values:
'missing'
,'outlier'
,'encoder'
,'scaler'
,'discretizing'
- Default value:
['missing', 'outlier', 'encoder', 'scaler']
config (
dict
, optional)- Custom processor configuration for each field
- Structure:
{processing_type: {field_name: processing_method}}
Processing Sequence
Preprocessor supports the following processing steps, executed in order:
- missing: Missing value handling
- outlier: Outlier handling
- encoder: Categorical variable encoding
- scaler: Numerical normalization
- discretizing: Discretization (mutually exclusive with encoder)
Default Processing Methods
Processing Type | Numerical | Categorical | Datetime |
---|---|---|---|
missing | missing_mean | missing_drop | missing_drop |
outlier | outlier_iqr | None | outlier_iqr |
encoder | None | encoder_uniform | None |
scaler | scaler_standard | None | scaler_standard |
discretizing | discretizing_kbins | encoder_label | discretizing_kbins |
Feature Documentation
For detailed processor descriptions, please refer to each feature page:
Precision Preservation
Preprocessor automatically preserves the precision of numerical fields:
- Precision Retention: The
type_attr.precision
in the schema will not be changed during transformation - Automatic Application: Rounding is automatically applied according to precision after transformation
- Memory Mechanism: Precision information is recorded in Status for use by subsequent modules
Execution Instructions
- Experiment names (second level) can be freely named; descriptive names are recommended
- Multiple experiments can be defined and will be executed sequentially
- Preprocessing results are passed to the Synthesizer module
Notes
discretizing
andencoder
cannot be used simultaneouslydiscretizing
must be the last step in the sequence- Some outlier handlers (such as
outlier_isolationforest
,outlier_lof
) are global transformations and will be applied to all fields - Custom config will override default settings
- Precision is automatically applied after preprocessing transformation to ensure numerical consistency
- For detailed processor parameter settings, please refer to each feature page and the Processor API documentation