fit()
Train the processor to learn statistical properties of the data.
Syntax
def fit(
data: pd.DataFrame,
sequence: list = None
) -> None
Parameters
data : pd.DataFrame, required
- Dataset for training
- Required parameter
- Processor learns statistical properties from this data
sequence : list, optional
- Custom processing sequence
- Default:
['missing', 'outlier', 'encoder', 'scaler']
- Available values:
'missing'
,'outlier'
,'encoder'
,'scaler'
,'discretizing'
Returns
None (method modifies instance state)
Description
The fit()
method trains the processor. This method will:
- Analyze statistical properties of the data (mean, standard deviation, categories, etc.)
- Create transformation rules for each processor
- Prepare for subsequent
transform()
operations
This method must be called before transform()
.
Basic Example
from petsard import Loader, Processor
# Load data
loader = Loader('data.csv', schema='schema.yaml')
data, schema = loader.load()
# Create and train processor
processor = Processor(metadata=schema)
processor.fit(data)
# Transform data
processed_data = processor.transform(data)
Custom Processing Sequence
from petsard import Processor
# Use only missing value handling and encoding
processor = Processor(metadata=schema)
processor.fit(data, sequence=['missing', 'encoder'])
# Use complete sequence
processor = Processor(metadata=schema)
processor.fit(
data,
sequence=['missing', 'outlier', 'encoder', 'scaler']
)
Using Discretization
from petsard import Processor
# Use discretization (cannot be used with encoder simultaneously)
processor = Processor(metadata=schema)
processor.fit(
data,
sequence=['missing', 'outlier', 'discretizing']
)
Training Workflow
Start
↓
Validate sequence validity
↓
Create Mediator for each step
↓
Train processors in sequence order:
- missing: Learn fill values (mean, median, etc.)
- outlier: Learn outlier thresholds
- encoder: Learn category mappings
- scaler: Learn scaling parameters (mean, std, etc.)
↓
Set training complete flag
↓
End
Notes
- Must call this method before
transform()
- Training data should have the same structure as data to be transformed later
discretizing
andencoder
cannot be used togetherdiscretizing
must be the last step in the sequence- Maximum of 4 processing steps supported
- Some processors (e.g.,
outlier_isolationforest
) perform global transformation - Statistical information learned during training is saved in the processor instance
- Calling
fit()
again will overwrite previous training results