resample_until_satisfy()

Repeatedly resample until constraints are satisfied and target row count is reached.

Syntax

def resample_until_satisfy(
    data: pd.DataFrame,
    target_rows: int,
    synthesizer,
    postprocessor=None,
    max_trials: int = 300,
    sampling_ratio: float = 10.0,
    verbose_step: int = 10
) -> pd.DataFrame

Parameters

data : pd.DataFrame, required
- Input dataframe to apply constraints to
- Required parameter
- If partially constrained data already exists, it serves as the base for further supplementation
target_rows : int, required
- Target number of rows
- Required parameter
- Final returned dataframe will contain this number of rows
synthesizer : Synthesizer, required
- Synthesizer instance for generating synthetic data
- Required parameter
- Must be a synthesizer already trained via \1
postprocessor : Postprocessor, optional
- Postprocessor for data transformation
- Used to convert synthetic data back to original format
- Default value: None
max_trials : int, optional
- Maximum number of attempts
- Stops even if target row count not met after reaching this number
- Default value: 300
sampling_ratio : float, optional
- Multiple of target rows to generate each time
- Used to compensate for data loss from constraint filtering
- Default value: 10.0 (generate 10 times the data)
verbose_step : int, optional
- Display progress every N attempts
- Set to 0 to disable progress display
- Default value: 10

Return Value

pd.DataFrame
- Dataframe satisfying all constraints with target row count
- If max_trials reached without satisfaction, returns collected data (may be less than target_rows)

Attributes

The following attribute is set after execution:

resample_trails : int
- Number of attempts required to reach target
- Can be used to evaluate constraint strictness

Description

The \1 method is suitable for situations with strict constraints where filtered data is insufficient. It will:

First apply constraints to input data
Calculate amount of data needed for supplementation
Iteratively:
- Generate new synthetic data using synthesizer
- Apply postprocessor (if present)
- Apply all constraint conditions
- Accumulate data meeting conditions
- Check if target row count is reached
After reaching target, randomly sample target number of rows

Resampling Flow

Start
  ↓
Apply constraints to input data
  ↓
Sufficient? ──Yes──> Random sample target_rows rows ──> Complete
  ↓ No
Start iteration (trials = 0)
  ↓
Generate target_rows × sampling_ratio rows of synthetic data
  ↓
Apply postprocessor (if present)
  ↓
Apply all constraint conditions
  ↓
Accumulate data meeting conditions
  ↓
trials += 1
  ↓
Data sufficient? ──Yes──> Random sample target_rows rows ──> Complete
  ↓ No
trials < max_trials? ──Yes──> Back to "Generate synthetic data"
  ↓ No
Return collected data (warning)

Basic Examples

Simple Resampling

from petsard import Constrainer, Synthesizer
import pandas as pd

# Prepare original data
df = pd.DataFrame({
    'age': [25, 30, 45, 55, 60],
    'salary': [50000, 60000, 80000, 90000, 95000]
})

# Train synthesizer
synthesizer = Synthesizer(method='default')
synthesizer.create(metadata=schema)
synthesizer.fit(df)

# Configure strict constraints
config = {
    'field_constraints': [
        "age >= 25 & age <= 50",  # Limit age range
        "salary >= 60000 & salary <= 85000"  # Limit salary range
    ]
}

constrainer = Constrainer(config)

# Resample until reaching 100 rows
result = constrainer.resample_until_satisfy(
    data=pd.DataFrame(),  # Start from empty
    target_rows=100,
    synthesizer=synthesizer,
    max_trials=50,
    sampling_ratio=20.0  # Generate 2000 rows each time
)

print(f"Target rows: 100")
print(f"Actual rows: {len(result)}")
print(f"Attempts: {constrainer.resample_trails}")
# Target rows: 100
# Actual rows: 100
# Attempts: 3

Using Postprocessor

from petsard import Constrainer, Synthesizer, Preprocessor, Postprocessor
import pandas as pd

# Prepare and preprocess data
df = pd.DataFrame({
    'age': [25, 30, 45, 55],
    'category': ['A', 'B', 'A', 'C']
})

preprocessor = Preprocessor('default')
processed_data = preprocessor.fit_transform(df)

# Train synthesizer
synthesizer = Synthesizer(method='default')
synthesizer.create(metadata=schema)
synthesizer.fit(processed_data)

# Create postprocessor
postprocessor = Postprocessor('default')
postprocessor.fit(df)  # Train with original data

# Configure constraints
config = {
    'field_constraints': [
        "age >= 20 & age <= 50"
    ],
    'field_proportions': [
        {'fields': 'category', 'mode': 'all', 'tolerance': 0.1}
    ]
}

constrainer = Constrainer(config)

# Resample (with postprocessing)
result = constrainer.resample_until_satisfy(
    data=pd.DataFrame(),
    target_rows=200,
    synthesizer=synthesizer,
    postprocessor=postprocessor,  # Convert encoded data back to original format
    max_trials=100,
    verbose_step=20  # Show progress every 20 attempts
)

print(f"Final row count: {len(result)}")
print(f"Attempts: {constrainer.resample_trails}")
print("\nCategory distribution:")
print(result['category'].value_counts())

Advanced Examples

Expanding from Existing Data

from petsard import Constrainer, Synthesizer
import pandas as pd

# Already have partial data meeting constraints
existing_data = pd.DataFrame({
    'age': [28, 35, 42],
    'performance': [5, 5, 4],
    'education': ['PhD', 'Master', 'PhD']
})

# Configure constraints
config = {
    'field_constraints': [
        "age >= 25 & age <= 50",
        "performance >= 4"
    ],
    'field_combinations': [
        (
            {'education': 'performance'},
            {'PhD': [4, 5], 'Master': [4, 5]}
        )
    ]
}

constrainer = Constrainer(config)

# Expand from existing data to 100 rows
result = constrainer.resample_until_satisfy(
    data=existing_data,  # Use existing data as base
    target_rows=100,
    synthesizer=synthesizer,
    max_trials=50
)

print(f"Original data: {len(existing_data)} rows")
print(f"Final data: {len(result)} rows")
print(f"Added data: {len(result) - len(existing_data)} rows")

Monitoring Resampling Process

from petsard import Constrainer, Synthesizer
import pandas as pd

# Configure very strict constraints
config = {
    'field_constraints': [
        "age >= 30 & age <= 35",  # Very narrow range
        "salary >= 70000 & salary <= 75000",
        "performance == 5"  # Must be highest score
    ],
    'field_combinations': [
        (
            {'education': 'salary'},
            {'PhD': [70000, 75000]}  # Only allow PhD
        )
    ]
}

constrainer = Constrainer(config)

# Resample and monitor process
print("Starting resampling...")
result = constrainer.resample_until_satisfy(
    data=pd.DataFrame(),
    target_rows=50,
    synthesizer=synthesizer,
    max_trials=200,
    sampling_ratio=50.0,  # Increase sampling ratio due to strict constraints
    verbose_step=10  # Show progress every 10 attempts
)
# Trial 10: Got 15 rows, need 35 more
# Trial 20: Got 28 rows, need 22 more
# Trial 30: Got 41 rows, need 9 more
# Trial 40: Got 50 rows, need 0 more

print(f"\nComplete!")
print(f"Target rows: 50")
print(f"Actual rows: {len(result)}")
print(f"Total attempts: {constrainer.resample_trails}")
print(f"Average valid data per attempt: {len(result) / constrainer.resample_trails:.2f} rows")

Handling Sampling Failure

from petsard import Constrainer, Synthesizer
import pandas as pd

# Configure nearly impossible-to-satisfy constraints
config = {
    'field_constraints': [
        "age == 25 & salary == 100000"  # Extremely specific condition
    ]
}

constrainer = Constrainer(config)

# Attempt resampling
result = constrainer.resample_until_satisfy(
    data=pd.DataFrame(),
    target_rows=100,
    synthesizer=synthesizer,
    max_trials=50,
    sampling_ratio=100.0,
    verbose_step=10
)

if len(result) < 100:
    print(f"Warning: Only collected {len(result)} rows (target 100 rows)")
    print(f"Attempts reached limit: {constrainer.resample_trails}")
    print("Suggestion: Relax constraints or increase max_trials/sampling_ratio")
else:
    print(f"Successfully collected {len(result)} rows")

Optimizing Sampling Parameters

from petsard import Constrainer, Synthesizer
import pandas as pd
import time

config = {
    'field_constraints': [
        "age >= 25 & age <= 45"
    ]
}

constrainer = Constrainer(config)

# Test different sampling_ratio values
for ratio in [5.0, 10.0, 20.0, 50.0]:
    start_time = time.time()

    result = constrainer.resample_until_satisfy(
        data=pd.DataFrame(),
        target_rows=100,
        synthesizer=synthesizer,
        sampling_ratio=ratio,
        verbose_step=0  # Disable progress display
    )

    elapsed = time.time() - start_time

    print(f"Sampling Ratio: {ratio}")
    print(f"  Attempts: {constrainer.resample_trails}")
    print(f"  Execution time: {elapsed:.2f} seconds")
    print(f"  Success rate: {len(result) / (constrainer.resample_trails * 100 * ratio) * 100:.2f}%")
    print()

Important Notes

Synthesizer State: synthesizer must already be trained via \1
Data Accumulation: Automatically removes duplicate rows to ensure data diversity
Memory Usage: Large sampling_ratio and multiple iterations consume more memory
Parameter Tuning:
- For strict constraints, increase sampling_ratio
- For poor synthesizer quality, increase max_trials
- For quick testing, reduce target_rows
Failure Handling: Reaching max_trials issues warning but still returns collected data
Randomness: Final sampling uses fixed seed (random_state=42) to ensure reproducibility
Performance Considerations: Each iteration fully applies all constraints; may be slow for large datasets
Progress Display: Set verbose_step=0 to disable progress output
Proportion Maintenance: field_proportions automatically uses target_rows as target
Initial Data: Providing initial data can accelerate collection process

Related Methods

\1: Initialize constraint configuration
\1: Single application of constraints
\1: Register custom constraint types

validate()register()