Discretizing

Converts continuous numerical data into discrete categories or intervals, mutually exclusive with encoding (encoder).

Usage Examples

Basic Usage

Preprocessor:
  demo:
    method: 'default'
    sequence:
      - missing
      - outlier
      - discretizing  # Use discretizing, not encoder

Custom K-bins Parameters

Preprocessor:
  custom:
    method: 'default'
    sequence:
      - missing
      - outlier
      - discretizing
    config:
      discretizing:
        age:
          method: 'discretizing_kbins'
          n_bins: 10                    # Divide into 10 bins
        income:
          method: 'discretizing_kbins'
          n_bins: 5                     # Divide into 5 bins

Available Processors

ProcessorDescriptionApplicable TypeOutput
discretizing_kbinsK-bins discretizationNumericalInteger labels
encoder_labelLabel encodingCategoricalInteger labels

Note: In the discretizing sequence, categorical data will automatically use encoder_label for encoding.

Processor Details

discretizing_kbins

K-bins Discretization: Divides continuous values into k equal-width intervals.

Parameters:

  • n_bins (int, optional)
    • Number of bins (k value)
    • Default value: 5
    • Example: n_bins: 10

Features:

  • Equal-width binning
  • Outputs integer labels (0, 1, 2, …)
  • Reduces data dimensionality and complexity

Example:

config:
  discretizing:
    age:
      method: 'discretizing_kbins'
      n_bins: 5

Discretization Example:

Original values: [18, 25, 35, 45, 55, 65]
n_bins = 5

Bin division:
[18-27.4) → 0
[27.4-36.8) → 1
[36.8-46.2) → 2
[46.2-55.6) → 3
[55.6-65] → 4

Discretized result: [0, 0, 1, 2, 3, 4]

Processing Logic

Numerical Data (discretizing_kbins)

Training phase (fit):
  Calculate and store bin boundaries

Transform phase (transform):
  Map values to integer labels based on bin boundaries

Inverse transform phase (inverse_transform):
  Restore integer labels to bin midpoint values

Categorical Data (encoder_label)

Training phase (fit):
  Create mapping from categories to integers

Transform phase (transform):
  Convert categories to integers based on mapping

Inverse transform phase (inverse_transform):
  Restore integers to categories based on mapping

Default Behavior

Default processing when using discretizing sequence:

Data TypeDefault ProcessorDescription
Numericaldiscretizing_kbinsK-bins discretization (k=5)
Categoricalencoder_labelLabel encoding
Datetimediscretizing_kbinsK-bins discretization (k=5)

Differences from encoder

Featurediscretizingencoder
Numerical OutputDiscrete integers (0, 1, 2, …)Continuous values or multiple columns
Use CaseDiscretization needsGeneral encoding needs
Categorical HandlingLabel encodingMultiple encodings (Uniform/Label/OneHot)
With scalerUsually not usedUsually used together
Sequence PositionMust be last stepBefore scaler

Usage Restrictions

1. Mutually Exclusive with encoder

# ❌ Wrong: Cannot use both
Preprocessor:
  wrong:
    method: 'default'
    sequence:
      - missing
      - encoder       # Error!
      - discretizing  # Mutually exclusive with encoder
# ✅ Correct: Use only one
Preprocessor:
  correct:
    method: 'default'
    sequence:
      - missing
      - outlier
      - discretizing  # Correct

2. Must Be Last Step

# ❌ Wrong: Steps after discretizing
Preprocessor:
  wrong:
    method: 'default'
    sequence:
      - missing
      - discretizing
      - scaler        # Error! discretizing must be last
# ✅ Correct: discretizing is the last step
Preprocessor:
  correct:
    method: 'default'
    sequence:
      - missing
      - outlier
      - discretizing  # Correct: last step

Complete Example

Loader:
  load_data:
    filepath: 'data.csv'
    schema: 'schema.yaml'

Preprocessor:
  discretize_data:
    method: 'default'
    sequence:
      - missing
      - outlier
      - discretizing  # Note: No encoder and scaler
    config:
      # Missing value handling
      missing:
        age: 'missing_median'
        income: 'missing_mean'
      
      # Outlier handling
      outlier:
        age: 'outlier_iqr'
        income: 'outlier_iqr'
      
      # Discretizing configuration
      discretizing:
        # Numerical fields
        age:
          method: 'discretizing_kbins'
          n_bins: 10                    # Divide age into 10 bins
        income:
          method: 'discretizing_kbins'
          n_bins: 5                     # Divide income into 5 bins
        hours_per_week:
          method: 'discretizing_kbins'
          n_bins: 8                     # Divide hours into 8 bins
        
        # Categorical fields (automatically uses encoder_label)
        gender: 'encoder_label'
        education: 'encoder_label'

Synthesizer:
  synthesize:
    method: 'default'

Postprocessor:
  postprocess:
    method: 'default'

Use Cases

1. Simplify Data Distribution

# Simplify continuous distribution into discrete intervals
Preprocessor:
  simplify:
    method: 'default'
    sequence:
      - missing
      - discretizing
    config:
      discretizing:
        salary:
          method: 'discretizing_kbins'
          n_bins: 3  # Low/Medium/High three levels

2. Reduce Data Dimensionality

# Reduce numerical precision, decrease synthesis difficulty
Preprocessor:
  reduce_dimension:
    method: 'default'
    sequence:
      - missing
      - outlier
      - discretizing
    config:
      discretizing:
        age:
          method: 'discretizing_kbins'
          n_bins: 5
        score:
          method: 'discretizing_kbins'
          n_bins: 10

Notes

  • Mutual Exclusivity: Cannot be used with encoder simultaneously
  • Position Restriction: Must be the last step in the sequence
  • Restoration Precision: Restoration uses bin midpoints, won’t fully restore original values
  • Number of Bins: n_bins should be adjusted based on data range and requirements
  • Synthesizer Impact: Some synthesizers (like PAC-Synth, DPCTGAN) may produce floating-point numbers, which the system will automatically round
  • Use Case: Suitable for situations requiring discretized output, such as certain privacy protection algorithms
  • NA Handling: During post-processing, NA values will be removed before restoration

Related Documentation