Processor API (WIP)
Data processing module supporting both preprocessing and postprocessing operations.
Class Architecture
classDiagram class Processor { <<Main Class>> -_metadata: Schema -_config: dict -_sequence: list -_mediator: dict +__init__(metadata, config) +fit(data, sequence) +transform(data) DataFrame +inverse_transform(data) DataFrame +get_config(col) dict +update_config(config) +get_changes() DataFrame } class MissingHandler { <<Sub-processor>> +fit(data) +transform(data) +inverse_transform(data) } class OutlierHandler { <<Sub-processor>> +fit(data) +transform(data) } class Encoder { <<Sub-processor>> +fit(data) +transform(data) +inverse_transform(data) } class Scaler { <<Sub-processor>> +fit(data) +transform(data) +inverse_transform(data) } class Mediator { <<Coordinator>> +fit(data) +transform(data) DataFrame +inverse_transform(data) DataFrame } class Schema { <<Configuration>> +attributes: dict +metadata: dict } Processor *-- MissingHandler : uses Processor *-- OutlierHandler : uses Processor *-- Encoder : uses Processor *-- Scaler : uses Processor *-- Mediator : coordinates Processor ..> Schema : depends on note for Processor "Preprocessing: fit() + transform()<br/>Postprocessing: inverse_transform()" note for MissingHandler "missing_mean, missing_median<br/>missing_mode, missing_drop" note for OutlierHandler "outlier_zscore, outlier_iqr<br/>outlier_lof, outlier_isolationforest" note for Encoder "encoder_uniform, encoder_label<br/>encoder_onehot, encoder_date" note for Scaler "scaler_standard, scaler_minmax<br/>scaler_log, scaler_timeanchor"
Legend:
- Blue boxes: Main classes
- Orange boxes: Sub-processor classes
- Light purple boxes: Configuration and data classes
*--
: Composition relationship..>
: Dependency relationship
Basic Usage
Preprocessing
from petsard import Processor
# Create processor
processor = Processor(metadata=schema)
# Fit and transform data
processor.fit(data)
processed_data = processor.transform(data)
Postprocessing
# Use the same processor instance for postprocessing
restored_data = processor.inverse_transform(synthetic_data)
Complete Workflow
from petsard import Loader, Processor, Synthesizer
# 1. Load data
loader = Loader('data.csv', schema='schema.yaml')
data, schema = loader.load()
# 2. Preprocessing
processor = Processor(metadata=schema)
processor.fit(data)
processed_data = processor.transform(data)
# 3. Synthesize data
synthesizer = Synthesizer(method='default')
synthesizer.create(metadata=schema)
synthesizer.fit_sample(processed_data)
synthetic_data = synthesizer.data_syn
# 4. Postprocessing (restoration)
restored_data = processor.inverse_transform(synthetic_data)
Constructor (init)
Initialize a data processor instance.
Syntax
def __init__(
metadata: Schema,
config: dict = None
)
Parameters
metadata : Schema, required
- Data structure definition (Schema object)
- Required parameter
- Provides metadata and type information for data fields
config : dict, optional
- Custom data processing configuration
- Default:
None
- Used to override default processing procedures
- Structure:
{processor_type: {field_name: processing_method}}
Returns
- Processor
- Initialized processor instance
Usage Examples
from petsard import Loader, Processor
# Load data and schema
loader = Loader('data.csv', schema='schema.yaml')
data, schema = loader.load()
# Basic usage - use default configuration
processor = Processor(metadata=schema)
# Use custom configuration
custom_config = {
'missing': {
'age': 'missing_mean',
'income': 'missing_median'
},
'outlier': {
'age': 'outlier_zscore',
'income': 'outlier_iqr'
},
'encoder': {
'gender': 'encoder_onehot',
'education': 'encoder_label'
},
'scaler': {
'age': 'scaler_minmax',
'income': 'scaler_standard'
}
}
processor = Processor(metadata=schema, config=custom_config)
Processing Sequence
Processor supports the following processing steps:
- missing: Missing value handling
- outlier: Outlier detection and handling
- encoder: Categorical variable encoding
- scaler: Numerical normalization
- discretizing: Discretization (mutually exclusive with encoder)
Default sequence: ['missing', 'outlier', 'encoder', 'scaler']
Default Processing Methods
Processor Type | Numerical | Categorical | Datetime |
---|---|---|---|
missing | MissingMean | MissingDrop | MissingDrop |
outlier | OutlierIQR | None | OutlierIQR |
encoder | None | EncoderUniform | None |
scaler | ScalerStandard | None | ScalerStandard |
discretizing | DiscretizingKBins | EncoderLabel | DiscretizingKBins |
Preprocessing vs Postprocessing
Operation | Preprocessing Method | Postprocessing Method | Description |
---|---|---|---|
Training | fit() | - | Learn data statistics |
Transform | transform() | - | Apply preprocessing transformations |
Restore | - | inverse_transform() | Apply postprocessing restoration |
Note: Preprocessing and postprocessing use the same Processor instance to ensure transformation consistency.
Available Processors
Missing Value Processors
missing_mean
: Fill with mean valuemissing_median
: Fill with median valuemissing_mode
: Fill with mode valuemissing_simple
: Fill with specified valuemissing_drop
: Drop rows with missing values
Outlier Processors
outlier_zscore
: Z-Score method (threshold 3)outlier_iqr
: Interquartile Range method (1.5 IQR)outlier_isolationforest
: Isolation Forest algorithmoutlier_lof
: Local Outlier Factor algorithm
Encoders
encoder_uniform
: Uniform encoding (allocate range by frequency)encoder_label
: Label encoding (integer mapping)encoder_onehot
: One-Hot encodingencoder_date
: Date format conversion
Scalers
scaler_standard
: Standardization (mean 0, std 1)scaler_minmax
: Min-Max scaling (range [0, 1])scaler_zerocenter
: Zero centering (mean 0)scaler_log
: Logarithmic transformationscaler_log1p
: log(1+x) transformationscaler_timeanchor
: Time anchor scaling
Discretization
discretizing_kbins
: K-bins discretization
Notes
- Recommended Practice: Use YAML configuration files instead of direct Python API usage
- Processing Order:
- Preprocessing: Must call
fit()
beforetransform()
- Postprocessing: Must complete preprocessing before calling
inverse_transform()
- Preprocessing: Must call
- Sequence Constraints:
discretizing
andencoder
cannot be used togetherdiscretizing
must be the last step in the sequence- Maximum of 4 processing steps supported
- Global Transformation: Some processors (e.g.,
outlier_isolationforest
,outlier_lof
) apply to all fields - Instance Reuse: Preprocessing and postprocessing should use the same Processor instance
- Schema Usage: Recommended to use Schema for defining data structure. See Metadater API documentation for detailed settings
- Documentation Note: This documentation is for internal development team reference only and does not guarantee backward compatibility