Scaling

Normalizes numerical data to a specific range or distribution to improve machine learning algorithm performance.

Usage Examples

Using Default Scaling

Preprocessor:
  demo:
    method: 'default'
    # Numerical fields: Use standardization
    # Categorical fields: No scaling

Customizing Scaling for Specific Fields

Preprocessor:
  custom:
    method: 'default'
    config:
      scaler:
        age: 'scaler_minmax'           # Min-Max scaling
        income: 'scaler_standard'      # Standardization
        hours_per_week: 'scaler_log'   # Log transformation
        gender: None                   # No scaling for categorical field

Time Anchor Scaling

Preprocessor:
  time_scaling:
    method: 'default'
    config:
      scaler:
        created_at:
          method: 'scaler_timeanchor'
          reference: 'event_time'      # Reference time field
          unit: 'D'                    # Unit: days

Available Processors

ProcessorDescriptionApplicable TypeOutput Range
scaler_standardStandardizationNumericalMean 0, Std 1
scaler_minmaxMin-Max scalingNumerical[0, 1]
scaler_zerocenterZero centeringNumericalMean 0
scaler_logLog transformationPositive valueslog(x)
scaler_log1plog(1+x) transformationNon-negative valueslog(1+x)
scaler_timeanchorTime anchor scalingDatetimeTime difference

Processor Details

scaler_standard

Standardization: Transforms to a distribution with mean 0 and standard deviation 1.

Formula:

x' = (x - μ) / σ

Features:

  • Preserves data distribution shape
  • Eliminates scale effects
  • Suitable for most machine learning algorithms

Example:

config:
  scaler:
    income: 'scaler_standard'

scaler_minmax

Min-Max Scaling: Linear scaling to [0, 1] range.

Formula:

x' = (x - min) / (max - min)

Features:

  • Preserves data distribution shape
  • Fixed output range
  • Sensitive to outliers

Example:

config:
  scaler:
    age: 'scaler_minmax'

scaler_zerocenter

Zero Centering: Adjusts mean to 0 while preserving original standard deviation.

Formula:

x' = x - μ

Features:

  • Only adjusts position, not scale
  • Preserves original data variance
  • Suitable when original scale needs to be maintained

Example:

config:
  scaler:
    temperature: 'scaler_zerocenter'

scaler_log

Log Transformation: Applies logarithmic transformation to values.

Formula:

x' = log(x)

Features:

  • Only applicable to positive numbers
  • Compresses large values, expands small values
  • Suitable for handling skewed distributions

Example:

config:
  scaler:
    salary: 'scaler_log'

Note: Data must be positive, otherwise it will produce errors.

scaler_log1p

log(1+x) Transformation: A variant of log transformation suitable for data containing zeros.

Formula:

x' = log(1 + x)

Features:

  • Applicable to non-negative numbers (including 0)
  • Better numerical stability
  • Uses exp(x’) - 1 for inverse transformation

Example:

config:
  scaler:
    count: 'scaler_log1p'

scaler_timeanchor

Time Anchor Scaling: Calculates time difference from a reference time.

Parameters:

  • reference (str, required): Reference time field name
  • unit (str, optional): Time difference unit
    • 'D': Days (default)
    • 'S': Seconds

Features:

  • Converts absolute time to relative time
  • Suitable for time series data
  • Requires another date field as reference point

Examples:

config:
  scaler:
    created_at:
      method: 'scaler_timeanchor'
      reference: 'event_time'
      unit: 'D'
    
    update_time:
      method: 'scaler_timeanchor'
      reference: 'created_at'
      unit: 'S'

Processing Logic

Statistical Scaling (Standard/MinMax/ZeroCenter)

Training phase (fit):
  Calculate statistical parameters (mean, standard deviation, min, max)

Transform phase (transform):
  Scale data using statistical parameters

Inverse transform phase (inverse_transform):
  Unscale using statistical parameters

Log Transformation (Log/Log1p)

Training phase (fit):
  No training needed

Transform phase (transform):
  Apply logarithmic function

Inverse transform phase (inverse_transform):
  Apply exponential function

Time Anchor (TimeAnchor)

Training phase (fit):
  Record reference field

Transform phase (transform):
  Calculate difference from reference time

Inverse transform phase (inverse_transform):
  Add back reference time to restore absolute time

Default Behavior

Default scaling for different data types:

Data TypeDefault ProcessorDescription
Numericalscaler_standardStandardization
CategoricalNoneNo scaling
Datetimescaler_standardStandardization (timestamp)

Scaling Method Comparison

MethodAdvantagesDisadvantagesUse Cases
StandardHighly versatile
Preserves distribution
No fixed rangeMost situations
Neural networks
MinMaxFixed range
Easy to understand
Sensitive to outliersFixed range needed
Image processing
ZeroCenterPreserves scale
Simple
Doesn’t change scaleNeed to preserve original scale
LogHandles skewness
Compresses large values
Only for positive numbersIncome, population
Right-skewed distribution
Log1pAllows zero values
Stable
Slight compressionCount data
Non-negative numbers
TimeAnchorRelative time
Easy to process
Requires reference fieldTime series
Event times

Complete Example

Loader:
  load_data:
    filepath: 'data.csv'
    schema: 'schema.yaml'

Preprocessor:
  scale_data:
    method: 'default'
    sequence:
      - missing
      - outlier
      - encoder
      - scaler
    config:
      scaler:
        # Numerical fields using different scaling methods
        age: 'scaler_minmax'              # Age: 0-1 range
        income: 'scaler_log1p'            # Income: Log transformation
        hours_per_week: 'scaler_standard' # Hours: Standardization
        
        # Time fields
        created_at:
          method: 'scaler_timeanchor'
          reference: 'birth_date'
          unit: 'D'
        
        # No scaling for categorical fields
        gender: None
        education: None

Notes

  • Processing Order: Scaling is usually the last preprocessing step (after encoding)
  • Log Limitation: scaler_log can only be used for positive numbers, otherwise it produces NaN
  • Outlier Impact: MinMax is sensitive to outliers, recommend handling outliers first
  • Reference Field: TimeAnchor’s reference field must exist and be datetime type
  • Restoration Accuracy: All scaling methods can be precisely restored (within numerical precision)
  • Synthetic Data: Scaled values of synthetic data may slightly exceed training data range
  • With discretizing: If using discretizing, scaler is typically not needed

Related Documentation