Scaling

Normalizes numerical data to a specific range or distribution to improve machine learning algorithm performance.

Usage Examples

Customizing Scaling for Specific Fields

---
Loader:
  load_benchmark_with_schema:
    filepath: benchmark://adult-income
    schema: benchmark://adult-income_schema

Preprocessor:
  scaling-specific:
    sequence:
      - scaler
    config:
      scaler:
        age: 'scaler_minmax'          # Min-Max scaling
        fnlwgt: 'scaler_standard'     # Standardization
        educational-num: 'scaler_log' # Log transformation
        capital-loss: None            # No scaling for categorical field

Reporter:
  save_data:
    method: save_data
    source:
      - Preprocessor
  save_schema:
    method: save_schema
    source:
      - Loader
      - Preprocessor
...

Time Anchor Scaling

---
Loader:
  load_benchmark_with_schema:
    filepath: benchmark://adult-income
    schema: benchmark://adult-income_schema

Preprocessor:
  time_scaling:
    sequence:
      - scaler
    config:
      scaler:
        created_at:
          method: 'scaler_timeanchor'
          reference: 'event_time'      # Reference time field
          unit: 'D'                    # Unit: days

Reporter:
  save_data:
    method: save_data
    source:
      - Preprocessor
  save_schema:
    method: save_schema
    source:
      - Loader
      - Preprocessor
...

Available Processors

ProcessorDescriptionApplicable TypeOutput Range
scaler_standardStandardizationNumericalMean 0, Std 1
scaler_minmaxMin-Max scalingNumerical[0, 1]
scaler_zerocenterZero centeringNumericalMean 0
scaler_logLog transformationPositive valueslog(x)
scaler_log1plog(1+x) transformationNon-negative valueslog(1+x)
scaler_timeanchorTime anchor scalingDatetimeTime difference

Processor Details

scaler_standard

Standardization: Transforms to a distribution with mean 0 and standard deviation 1.

Formula:

x' = (x - μ) / σ

Features:

  • Preserves data distribution shape
  • Eliminates scale effects
  • Suitable for most machine learning algorithms

scaler_minmax

Min-Max Scaling: Linear scaling to [0, 1] range.

Formula:

x' = (x - min) / (max - min)

Features:

  • Preserves data distribution shape
  • Fixed output range
  • Sensitive to outliers

scaler_zerocenter

Zero Centering: Adjusts mean to 0 while preserving original standard deviation.

Formula:

x' = x - μ

Features:

  • Only adjusts position, not scale
  • Preserves original data variance
  • Suitable when original scale needs to be maintained

scaler_log

Log Transformation: Applies logarithmic transformation to values.

Formula:

x' = log(x)

Features:

  • Only applicable to positive numbers
  • Compresses large values, expands small values
  • Suitable for handling skewed distributions

Note: Data must be positive, otherwise it will produce errors.

scaler_log1p

log(1+x) Transformation: A variant of log transformation suitable for data containing zeros.

Formula:

x' = log(1 + x)

Features:

  • Applicable to non-negative numbers (including 0)
  • Better numerical stability
  • Uses exp(x’) - 1 for inverse transformation

scaler_timeanchor

Time Anchor Scaling: Calculates time difference from a reference time.

Parameters:

  • reference (str or list[str], required): Reference time field name(s)
    • Single reference (str): Transforms anchor field to time difference from reference field
    • Multiple references (list[str]): Keeps anchor as datetime, transforms all reference fields to time differences from anchor
  • unit (str, optional): Time difference unit
    • 'D': Days (default)
    • 'S': Seconds

Features:

  • Converts absolute time to relative time
  • Supports one-to-one or one-to-many time relationships
  • Suitable for multi-timepoint data (e.g., company establishment date vs. multiple application/approval dates)

Usage Patterns:

  1. Single Reference Mode (one reference field)
scaler:
  created_at:
    method: 'scaler_timeanchor'
    reference: 'event_time'  # Single reference field
    unit: 'D'

Result: created_at is transformed to day difference from event_time (numerical), event_time remains as datetime

  1. Multiple Reference Mode (multiple reference fields)
scaler:
  established_date:
    method: 'scaler_timeanchor'
    reference:  # Multiple reference fields (list)
      - 'first_apply_date'
      - 'approval_date'
      - 'tracking_date'
    unit: 'D'

Result: established_date remains as datetime (anchor), three reference fields are transformed to day differences from anchor (numerical)

Processing Logic

Statistical Scaling (Standard/MinMax/ZeroCenter)

  • Training phase (fit): Calculate statistical parameters (mean, standard deviation, min, max)
  • Transform phase (transform): Scale data using statistical parameters
  • Inverse transform phase (inverse_transform): Unscale using statistical parameters

Log Transformation (Log/Log1p)

  • Training phase (fit): No training needed
  • Transform phase (transform): Apply logarithmic function
  • Inverse transform phase (inverse_transform): Apply exponential function

Time Anchor (TimeAnchor)

  • Training phase (fit): Record reference field
  • Transform phase (transform): Calculate difference from reference time
  • Inverse transform phase (inverse_transform): Add back reference time to restore absolute time

Default Behavior

Default scaling for different data types:

Data TypeDefault ProcessorDescription
Numericalscaler_standardStandardization
CategoricalNoneNo scaling
Datetimescaler_standardStandardization (timestamp)

Scaling Method Comparison

MethodAdvantagesDisadvantagesUse Cases
StandardHighly versatile
Preserves distribution
No fixed rangeMost situations
Neural networks
MinMaxFixed range
Easy to understand
Sensitive to outliersFixed range needed
Image processing
ZeroCenterPreserves scale
Simple
Doesn’t change scaleNeed to preserve original scale
LogHandles skewness
Compresses large values
Only for positive numbersIncome, population
Right-skewed distribution
Log1pAllows zero values
Stable
Slight compressionCount data
Non-negative numbers
TimeAnchorRelative time
Easy to process
Requires reference fieldTime series
Event times

Notes

  • Processing Order: Scaling is usually the last preprocessing step (after encoding)
  • Log Limitation: scaler_log can only be used for positive numbers, otherwise it produces NaN
  • Outlier Impact: MinMax is sensitive to outliers, recommend handling outliers first
  • Reference Field: TimeAnchor’s reference field must exist and be datetime type
  • Restoration Accuracy: All scaling methods can be precisely restored (within numerical precision)
  • Synthetic Data: Scaled values of synthetic data may slightly exceed training data range
  • With discretizing: If using discretizing, scaler is typically not needed