Data Synthesis with Default Parameters
The simplest way to generate privacy-enhanced synthetic data. Current default synthesis uses Gaussian Copula from SDV.
Click the below button to run this example in Colab:
Loader:
load_csv:
filepath: benchmark/adult-income.csv
Preprocessor:
default:
method: 'default'
Synthesizer:
default:
method: 'default'
Postprocessor:
default:
method: 'default'
Reporter:
output:
method: 'save_data'
source: 'Synthesizer'
YAML Parameters Detailed Explanation
Loader (Data Loading Module)
load_csv
: Experiment name, can be freely named, recommended to use descriptive namesfilepath
: Data file path- Value:
benchmark/adult-income.csv
- Description: Specifies the location of the data file to load. This example uses
adult-income.csv
, which you can replace with your own CSV file path - Supported formats: CSV, TSV, Excel (requires openpyxl), OpenDocument
- Supports relative or absolute paths
- Also supports
benchmark://
protocol for automatic standard dataset downloads
- Value:
Recommended: Use Schema:
To ensure data loading accuracy and consistency, it is highly recommended to use the schema
parameter to pre-define the data structure. Schema allows you to explicitly specify the data type of each column (numeric, categorical, datetime, etc.), constraints, and relationships between columns.
Example with schema:
Loader:
load_csv:
filepath: benchmark/adult-income.csv
schema: benchmark/adult-income_schema.yaml
For detailed information about Schema, please refer to the Schema YAML Documentation.
Preprocessor (Data Preprocessing Module)
default
: Experiment name, can be freely namedmethod
: Preprocessing method- Value:
default
- Description: Uses the default processing sequence, including the following steps:
- missing (missing value handling): Numeric columns use mean imputation, categorical columns are dropped
- outlier (outlier handling): Numeric columns use IQR method
- encoder (encoding): Categorical columns use uniform encoding (
encoder_uniform
) - scaler (scaling): Numeric columns use standardization (
scaler_standard
)
- Value:
Synthesizer (Synthetic Data Generation Module)
default
: Experiment name, can be freely namedmethod
: Synthesis method- Value:
default
- Description: Uses the default synthesis method, which is SDV Gaussian Copula
- Gaussian Copula is a statistical-based synthesis method that captures correlations between variables
- Value:
Postprocessor (Data Postprocessing Module)
default
: Experiment name, can be freely namedmethod
: Postprocessing method- Value:
default
- Description: Automatically performs reverse operations of Preprocessor to restore synthetic data to original format
- Restoration sequence (reverse of preprocessing):
- inverse scaler: Reverse scaling, restoring standardized values
- inverse encoder: Reverse encoding, restoring encoded categorical variables
- restore missing: Reinsert missing values according to original proportion
- Note: Outlier handling cannot be reversed
- Value:
Reporter (Output Module)
output
: Experiment name, can be freely namedmethod
: Report method- Value:
save_data
- Description: Saves the output data from the specified module as a CSV file
- Value:
source
: Data source module- Value:
Synthesizer
- Description: Saves synthetic data generated from the Synthesizer module
- Can also choose other modules, such as
Preprocessor
,Postprocessor
, etc. - Default output file naming format:
petsard_Synthesizer[output].csv
- Value:
Execution Flow
- Loader loads the
adult-income.csv
data - Preprocessor performs preprocessing (impute missing values, handle outliers, encode, scale)
- Synthesizer generates synthetic data using the Gaussian Copula method
- Postprocessor restores synthetic data to original format (reverse scaling, reverse encoding, insert missing values)
- Reporter saves the final synthetic data as a CSV file
Advanced Usage
For custom parameters, please refer to the following documentation: