split()
Splits data into training and validation sets with enhanced overlap control.
Syntax
def split(
data: pd.DataFrame = None,
metadata: SchemaMetadata = None,
exist_train_indices: list[set] = None
) -> tuple[dict, dict, list[set]]
Parameters
data : pd.DataFrame, required
- The dataset to split
- Must be a pandas DataFrame
- Cannot be None
metadata : SchemaMetadata, optional
- Metadata for the dataset
- Contains schema information about the data
- Will be updated with split information
- Default:
None
exist_train_indices : list[set], optional
- List of existing training index sets to avoid overlap with
- Each set contains training indices from a previous split
- Used to ensure new splits don’t overlap with existing ones
- Default:
None
Returns
- tuple[dict, dict, list[set]]
- A tuple containing three elements:
split_data
(dict
): Dictionary of format{sample_num: {'train': df, 'validation': df}}
metadata_dict
(dict
): Dictionary of format{sample_num: {'train': metadata, 'validation': metadata}}
train_indices
(list[set]
): List of training index sets for each sample
- A tuple containing three elements:
Description
The split()
method performs data splitting using functional programming patterns with enhanced overlap control. It generates multiple train/validation splits based on the configuration provided during initialization.
The method performs the following operations:
- Validates input data
- Generates training indices based on overlap control settings
- Creates training and validation DataFrames
- Updates metadata if provided
- Returns results as immutable data structures
Example
from petsard import Splitter
import pandas as pd
# Create sample data
df = pd.DataFrame({
'feature1': range(100),
'feature2': range(100, 200),
'target': [0, 1] * 50
})
# Basic split
splitter = Splitter(num_samples=3, train_split_ratio=0.8)
split_data, metadata_dict, train_indices = splitter.split(data=df)
# Access results
for i in range(1, 4):
train_df = split_data[i]['train']
val_df = split_data[i]['validation']
print(f"Sample {i}: Train={len(train_df)}, Val={len(val_df)}")
Notes
- This method follows functional programming principles returning immutable data structures
- Sample numbers start from 1, not 0
- When
max_overlap_ratio
is set to 0.0, samples will have no overlap - If the method cannot generate valid samples within
max_attempts
, it will raise an exception - Metadata is optional but recommended for maintaining data lineage
- Returned DataFrames are copies, not references to the original data