Splitter API
Data splitting module for creating training and validation sets with overlap control.
Class Architecture
classDiagram class Splitter { -config: dict +__init__(num_samples, train_split_ratio, random_state, max_overlap_ratio, max_attempts) +split(data, metadata, exist_train_indices) tuple[dict, dict, list[set]] +get_train_indices() list[set] -_bootstrapping(index, exist_train_indices) dict -_check_overlap_acceptable(new_train_sample, existing_train_sets) bool -_update_metadata_with_split_info(metadata, train_rows, validation_rows) Schema } class Schema { +id: str +attributes: dict +description: str +stats: TableStats } class DataFrame { <<pandas>> +index: Index +columns: Index +iloc: IndexingMixin +reset_index() } Splitter ..> Schema : uses Splitter ..> DataFrame : uses
Legend:
- Blue box: Main class
- Orange box: Subclass implementations
- Light purple box: Configuration and data classes
<|--
: Inheritance relationship*--
: Composition relationship..>
: Dependency relationship
Basic Usage
from petsard import Splitter
# Basic splitting
splitter = Splitter(num_samples=3, train_split_ratio=0.8)
split_data, metadata_dict, train_indices = splitter.split(data=df)
# Strict overlap control
splitter = Splitter(
num_samples=5,
train_split_ratio=0.7,
max_overlap_ratio=0.1 # Maximum 10% overlap
)
Constructor (init)
Initialize a data splitter instance.
Syntax
def __init__(
num_samples: int = 1,
train_split_ratio: float = 0.8,
random_state: int | float | str = None,
max_overlap_ratio: float = 1.0,
max_attempts: int = 30
)
Parameters
num_samples : int, optional
- Number of times to resample the data
- Default:
1
- Must be positive integer
train_split_ratio : float, optional
- Ratio of data for training set
- Default:
0.8
- Range:
0.0
to1.0
random_state : int | float | str, optional
- Seed for reproducibility
- Default:
None
- Can be integer, float, or string
max_overlap_ratio : float, optional
- Maximum allowed overlap ratio between samples
- Default:
1.0
(100% - allows complete overlap) - Range:
0.0
to1.0
- Set to
0.0
for no overlap between samples
max_attempts : int, optional
- Maximum attempts for sampling with overlap control
- Default:
30
- Used when overlap control is active
Returns
- Splitter
- Initialized splitter instance
Examples
from petsard import Splitter
# Basic splitter with default settings
splitter = Splitter()
# Multiple samples with reproducibility
splitter = Splitter(
num_samples=5,
train_split_ratio=0.8,
random_state=42
)
# Strict overlap control
splitter = Splitter(
num_samples=3,
max_overlap_ratio=0.1, # Max 10% overlap
max_attempts=50
)
# No overlap between samples
splitter = Splitter(
num_samples=5,
max_overlap_ratio=0.0, # Completely non-overlapping
random_state="experiment_v1"
)
Notes
- The functional API returns tuples directly from the
split()
method - Uses functional programming patterns with immutable data structures
- For detailed split method usage, see the split() documentation
- Recommend using YAML configuration for complex experiments
- Bootstrap sampling is used internally for generating multiple samples