fit()

Train the processor to learn statistical properties of the data.

Syntax

def fit(
    data: pd.DataFrame,
    sequence: list = None
) -> None

Parameters

data : pd.DataFrame, required
- Dataset for training
- Required parameter
- Processor learns statistical properties from this data
sequence : list, optional
- Custom processing sequence
- Default: ['missing', 'outlier', 'encoder', 'scaler']
- Available values: 'missing', 'outlier', 'encoder', 'scaler', 'discretizing'

Returns

None (method modifies instance state)

Description

The fit() method trains the processor. This method will:

Analyze statistical properties of the data (mean, standard deviation, categories, etc.)
Create transformation rules for each processor
Prepare for subsequent transform() operations

This method must be called before transform().

Basic Example

from petsard import Loader, Processor

# Load data
loader = Loader('data.csv', schema='schema.yaml')
data, schema = loader.load()

# Create and train processor
processor = Processor(metadata=schema)
processor.fit(data)

# Transform data
processed_data = processor.transform(data)

Custom Processing Sequence

from petsard import Processor

# Use only missing value handling and encoding
processor = Processor(metadata=schema)
processor.fit(data, sequence=['missing', 'encoder'])

# Use complete sequence
processor = Processor(metadata=schema)
processor.fit(
    data,
    sequence=['missing', 'outlier', 'encoder', 'scaler']
)

Using Discretization

from petsard import Processor

# Use discretization (cannot be used with encoder simultaneously)
processor = Processor(metadata=schema)
processor.fit(
    data,
    sequence=['missing', 'outlier', 'discretizing']
)

Training Workflow

Start
  ↓
Validate sequence validity
  ↓
Create Mediator for each step
  ↓
Train processors in sequence order:
  - missing: Learn fill values (mean, median, etc.)
  - outlier: Learn outlier thresholds
  - encoder: Learn category mappings
  - scaler: Learn scaling parameters (mean, std, etc.)
  ↓
Set training complete flag
  ↓
End

Notes

Must call this method before transform()
Training data should have the same structure as data to be transformed later
discretizing and encoder cannot be used together
discretizing must be the last step in the sequence
Maximum of 4 processing steps supported
Some processors (e.g., outlier_isolationforest) perform global transformation
Statistical information learned during training is saved in the processor instance
Calling fit() again will overwrite previous training results

transform()