apply()
Apply configured constraints to input data.
Syntax
def apply(
df: pd.DataFrame,
target_rows: int = None
) -> pd.DataFrame
Parameters
df : pd.DataFrame, required
- Input dataframe to apply constraints to
- Required parameter
target_rows : int, optional
- Target number of rows
- Used internally by
\1
- Used for target row count setting in field proportion constraints
- Default value:
None
Return Value
- pd.DataFrame
- Dataframe after applying all constraint conditions
- May have fewer rows than input dataframe (due to constraint filtering)
Description
The \1
method applies all configured constraint conditions sequentially in the following order:
- NaN Groups (
nan_groups
): Handle null value related rules - Field Constraints (
field_constraints
): Check field value domain constraints - Field Combinations (
field_combinations
): Validate field combination rules - Field Proportions (
field_proportions
): Maintain field proportion distributions
Each stage filters data rows that don’t meet conditions, ultimately returning data that satisfies all constraints simultaneously.
Constraint Application Flow
Input Data (N rows)
↓
NaN Groups Processing (delete/erase/copy)
↓
Field Constraints Filtering (value domain checks)
↓
Field Combinations Filtering (combination rules)
↓
Field Proportions Filtering (proportion maintenance)
↓
Output Data (≤N rows)
Basic Examples
Simple Constraint Application
from petsard import Constrainer
import pandas as pd
# Prepare data
df = pd.DataFrame({
'age': [25, 15, 45, 70, 35],
'performance': [5, 3, 4, 2, 5],
'education': ['PhD', 'Bachelor', 'Master', 'Bachelor', 'PhD']
})
# Configure constraints
config = {
'field_constraints': [
"age >= 18 & age <= 65",
"performance >= 4"
]
}
# Apply constraints
constrainer = Constrainer(config)
result = constrainer.apply(df)
print(f"Original rows: {len(df)}")
print(f"After constraints: {len(result)}")
# Original rows: 5
# After constraints: 2 (only data with age 25 and 35 satisfy conditions)
Multiple Constraint Application
from petsard import Constrainer
import pandas as pd
# Prepare data
df = pd.DataFrame({
'name': ['Alice', None, 'Charlie', 'David'],
'age': [25, 30, 45, 55],
'salary': [50000, 60000, 80000, 90000],
'education': ['Bachelor', 'Master', 'PhD', 'Master']
})
# Configure multiple constraints
config = {
'nan_groups': {
'name': 'delete' # Delete rows where name is null
},
'field_constraints': [
"age >= 20 & age <= 50" # Age restriction
],
'field_combinations': [
(
{'education': 'salary'},
{
'PhD': [70000, 80000, 90000], # PhD salary range
'Master': [50000, 60000, 70000] # Master salary range
}
)
]
}
constrainer = Constrainer(config)
result = constrainer.apply(df)
print("Data after applying constraints:")
print(result)
# Only retains data that simultaneously satisfies:
# 1. name is not null
# 2. age is between 20-50
# 3. education-salary combination follows rules
Advanced Examples
With Field Proportion Constraints
from petsard import Constrainer
import pandas as pd
# Prepare data (imbalanced categories)
df = pd.DataFrame({
'category': ['A'] * 80 + ['B'] * 15 + ['C'] * 5,
'value': range(100)
})
print("Original distribution:")
print(df['category'].value_counts())
# A 80
# B 15
# C 5
# Configure field proportion constraints
config = {
'field_proportions': [
{
'fields': 'category',
'mode': 'all',
'tolerance': 0.1 # Allow 10% deviation
}
]
}
constrainer = Constrainer(config)
# Note: Usually target_rows is automatically set by resample_until_satisfy
# Set manually here for demonstration
result = constrainer.apply(df, target_rows=50)
print("\nDistribution after constraints:")
print(result['category'].value_counts())
# Will maintain original proportion (80:15:5), but total around 50 rows
Complex Condition Combinations
from petsard import Constrainer
import pandas as pd
# Prepare employee data
df = pd.DataFrame({
'workclass': ['Private', None, 'Government', 'Private', 'Self-emp'],
'occupation': ['Manager', 'Sales', None, 'Tech', 'Manager'],
'age': [35, 28, 45, 22, 50],
'hours_per_week': [40, 35, 50, 65, 38],
'income': ['>50K', '<=50K', '>50K', '<=50K', '>50K'],
'education': ['Master', 'Bachelor', 'PhD', 'Bachelor', 'Master']
})
config = {
'nan_groups': {
'workclass': 'delete', # Delete rows where workclass is null
'occupation': {
'erase': ['income'] # Clear income when occupation is null
}
},
'field_constraints': [
"age >= 18 & age <= 65",
"hours_per_week >= 20 & hours_per_week <= 60",
"(education == 'PhD' & income == '>50K') | education != 'PhD'"
],
'field_combinations': [
(
{'education': 'income'},
{
'PhD': ['>50K'], # PhD must have high income
'Master': ['>50K', '<=50K'] # Master can be high or low
}
)
]
}
constrainer = Constrainer(config)
result = constrainer.apply(df)
print(f"Original data: {len(df)} rows")
print(f"After constraints: {len(result)} rows")
print("\nData after constraints:")
print(result)
Important Notes
- Data Copy: Method copies input dataframe, does not modify original data
- Order Matters: Constraints are applied in fixed order, cannot be adjusted
- Data Reduction: Constraints typically filter data, returned row count may significantly decrease
- AND Logic: All constraints combined with AND, data must satisfy all to be retained
- target_rows: General users don’t need to manually set this parameter, used internally by
\1
- Empty Results: If constraints are too strict, may return empty dataframe
- Performance Considerations: Complex constraints on large datasets may require longer execution time
- Field Proportions: Proportion maintenance only occurs when field_proportions is configured and target_rows is provided
- Validation Recommendation: Recommended to test constraint reasonableness with small samples before applying
Related Methods
\1
: Initialize constraint configuration\1
: Resample repeatedly until constraints satisfied\1
: Register custom constraint types