Splitter YAML

YAML configuration format for the Splitter module.

Usage Examples

Click the below button to run this example in Colab:

Note: If using Colab, please see the runtime setup guide.

Basic Splitting

Loader:
  load_benchmark_with_schema:
    filepath: benchmark://adult-income
    schema: benchmark://adult-income_schema
Splitter:
  basic_split:
    num_samples: 3
    train_split_ratio: 0.8

Controlled Overlap

Loader:
  load_benchmark_with_schema:
    filepath: benchmark://adult-income
    schema: benchmark://adult-income_schema
Splitter:
  controlled_overlap:
    num_samples: 10
    train_split_ratio: 0.8
    max_overlap_ratio: 0.8  # Allow 80% overlap
    max_attempts: 500
    random_state: 42

No Overlap Splitting

Loader:
  load_benchmark_with_schema:
    filepath: benchmark://adult-income
    schema: benchmark://adult-income_schema
Splitter:
  no_overlap:
    num_samples: 2
    train_split_ratio: 0.01
    max_overlap_ratio: 0.0  # Completely non-overlapping
    max_attempts: 100000
    random_state: 42

⚠️

Note: This example demonstrates the no-overlap configuration. Since this package uses a sampling-then-comparison algorithm, achieving complete non-overlap (max_overlap_ratio: 0.0) is extremely difficult. This feature aims to provide sampling diversity. In practice, we recommend allowing minimal overlap (e.g., max_overlap_ratio: 0.8) to ensure execution efficiency.

Main Parameters

num_samples (integer, optional)
- Number of times to resample the data
- Default: 1
train_split_ratio (float, optional)
- Ratio of data for training set
- Default: 0.8

Parameter Details

Required Parameters

Splitter has no required parameters, all parameters are optional.

Optional Parameters

Parameter	Type	Default	Description	Example
`num_samples`	`integer`	`1`	Number of times to resample	`5`
`train_split_ratio`	`float`	`0.8`	Ratio of data for training set (0.0 to 1.0)	`0.7`
`random_state`	`integer\|string`	`null`	Seed for reproducibility	`42` or `"exp_v1"`
`max_overlap_ratio`	`float`	`1.0`	Maximum overlap ratio between samples (0.0 to 1.0)	`0.1`
`max_attempts`	`integer`	`30`	Maximum attempts for sampling with overlap control	`50`

Use Cases

The Splitter module is primarily designed to meet the requirements of the Evaluator module for splitting datasets into training and test sets. For detailed post-split evaluation configuration, please refer to the Evaluator YAML documentation.

Execution Notes

Multiple split experiments can be defined and will be executed sequentially
Split results are passed to the next module (e.g., Synthesizer)
Sample numbering starts from 1 (not 0)

Important Notes

Overly strict overlap constraints may cause sampling failures, adjust max_attempts accordingly
Set random_state for reproducible results

Loader YAML Preprocessor YAML