inverse_transform()
Execute data postprocessing restoration, converting processed data back to original format.
Syntax
def inverse_transform(
data: pd.DataFrame
) -> pd.DataFrame
Parameters
- data : pd.DataFrame, required
- Dataset to be restored (typically synthetic data)
- Required parameter
- Should be data that went through the same preprocessing steps
Returns
- pd.DataFrame
- Restored data
- All inverse transformation steps applied
- Data format close to original
Description
The inverse_transform()
method performs inverse data transformation. This method will:
- Execute restoration operations in reverse order of preprocessing sequence
- Sequentially apply inverse scaling, inverse encoding, restore missing values, etc.
- Align data types to original schema
- Return data in format close to original
This method must be called after fit()
and transform()
.
Basic Example
from petsard import Loader, Processor, Synthesizer
# 1. Load and preprocess data
loader = Loader('data.csv', schema='schema.yaml')
data, schema = loader.load()
processor = Processor(metadata=schema)
processor.fit(data)
processed_data = processor.transform(data)
# 2. Synthesize data
synthesizer = Synthesizer(method='default')
synthesizer.create(metadata=schema)
synthesizer.fit_sample(processed_data)
synthetic_data = synthesizer.data_syn
# 3. Postprocess restoration
restored_data = processor.inverse_transform(synthetic_data)
print(f"Original data shape: {data.shape}")
print(f"Restored data shape: {restored_data.shape}")
Complete Workflow
from petsard import Loader, Processor, Synthesizer
import pandas as pd
# Load data
loader = Loader('data.csv', schema='schema.yaml')
data, schema = loader.load()
# Preprocessing
processor = Processor(metadata=schema)
processor.fit(data, sequence=['missing', 'outlier', 'encoder', 'scaler'])
processed_data = processor.transform(data)
# Synthesis
synthesizer = Synthesizer(method='default')
synthesizer.create(metadata=schema)
synthesizer.fit_sample(processed_data, sample_num_rows=len(data))
synthetic_data = synthesizer.data_syn
# Postprocessing restoration
# Restoration sequence automatically becomes: ['scaler', 'encoder', 'missing']
restored_data = processor.inverse_transform(synthetic_data)
# Compare data distributions
print("Original data descriptive statistics:")
print(data.describe())
print("\nRestored data descriptive statistics:")
print(restored_data.describe())
Restoration Workflow
Start
↓
Check if trained
↓
Set missing value restoration parameters
↓
Reverse preprocessing sequence (remove outlier)
↓
Execute restoration in reverse order:
4. Scaler: Inverse scale to original range
↓ (Mediator adjusts columns)
3. Encoder: Decode categorical variables
↓ (Mediator adjusts columns)
1. Missing: Insert NA proportionally
↓
Align data types to schema
↓
Return restored data
↓
End
Restoration Steps Explained
1. Inverse Scaling
- Restore normalized values to original range
- Use scaling parameters learned during preprocessing
- Examples: inverse standardization, inverse min-max scaling
2. Inverse Encoding
- Convert numerical values back to categorical labels
- One-Hot encoding restored to single column
- Use mapping table created during preprocessing
3. Restore Missing Values
- According to missing value ratio in original data
- Randomly select positions to insert
NA
values - Calculate missing ratio independently for each column
4. Align Data Types
- Adjust data types according to schema definition
- Ensure categorical, numerical, datetime types are correct
- Handle special datetime formats
Missing Value Restoration Mechanism
# System automatically calculates and restores missing values
# Assume original data has 10% missing values
# During preprocessing, record:
# - Global missing value ratio: 10%
# - age column missing value ratio: 15%
# - income column missing value ratio: 5%
# During postprocessing, restore:
# 1. Randomly select 10% of data rows
# 2. In age column, set 15% of these rows to NA
# 3. In income column, set 5% of these rows to NA
Notes
- Must complete
fit()
andtransform()
first - Input data should be data processed with same preprocessing
- Outlier processing is not restored (step is skipped)
- Missing value positions are random, not exactly same as original data
- One-Hot encoding reduces number of columns
- Restored data types align to original schema
- Returns a copy of data, does not modify input
- Datetime data converted to appropriate format
- Floating point numbers from some synthesizers are rounded to integers (for discretizing cases)