Data Synthesis with Default Parameters
The simplest way to generate privacy-enhanced synthetic data. Current default synthesis uses built-in Gaussian Copula implementation.
Click the below button to run this example in Colab:
Note: If using Colab, please see the runtime setup guide.
Loader:
load_csv:
filepath: benchmark/adult-income.csv
Preprocessor:
default:
method: 'default'
Synthesizer:
default:
method: 'default'
Postprocessor:
default:
method: 'default'
Reporter:
output:
method: 'save_data'
source: 'Synthesizer'YAML Parameters Detailed Explanation
Loader (Data Loading Module)
load_csv: Experiment name, can be freely named, recommended to use descriptive namesfilepath: Data file path- Value:
benchmark/adult-income.csv - Description: Specifies the location of the data file to load. This example uses
adult-income.csv, which you can replace with your own CSV file path - Supported formats: CSV, TSV, Excel (requires openpyxl), OpenDocument
- Supports relative or absolute paths
- Also supports
benchmark://protocol for automatic standard dataset downloads
- Value:
Recommended: Use Schema:
To ensure data loading accuracy and consistency, it is highly recommended to use the schema parameter to pre-define the data structure. Schema allows you to explicitly specify the data type of each column (numeric, categorical, datetime, etc.), constraints, and relationships between columns.
Example with schema:
Loader:
load_csv:
filepath: benchmark/adult-income.csv
schema: benchmark/adult-income_schema.yamlFor detailed information about Schema, please refer to the Schema YAML Documentation.
Preprocessor (Data Preprocessing Module)
default: Experiment name, can be freely namedmethod: Preprocessing method- Value:
default - Description: Uses the default processing sequence, including the following steps:
- missing (missing value handling): Numeric columns use mean imputation, categorical columns are dropped
- outlier (outlier handling): Numeric columns use IQR method
- encoder (encoding): Categorical columns use uniform encoding (
encoder_uniform) - scaler (scaling): Numeric columns use standardization (
scaler_standard)
- Value:
Synthesizer (Synthetic Data Generation Module)
default: Experiment name, can be freely namedmethod: Synthesis method- Value:
default - Description: Uses the default synthesis method, which is PETsARD Gaussian Copula
- Gaussian Copula is a statistical-based synthesis method that captures correlations between variables
- This is a built-in implementation, no external dependencies required
- Value:
Postprocessor (Data Postprocessing Module)
default: Experiment name, can be freely namedmethod: Postprocessing method- Value:
default - Description: Automatically performs reverse operations of Preprocessor to restore synthetic data to original format
- Restoration sequence (reverse of preprocessing):
- inverse scaler: Reverse scaling, restoring standardized values
- inverse encoder: Reverse encoding, restoring encoded categorical variables
- restore missing: Reinsert missing values according to original proportion
- Note: Outlier handling cannot be reversed
- Value:
Reporter (Output Module)
output: Experiment name, can be freely namedmethod: Report method- Value:
save_data - Description: Saves the output data from the specified module as a CSV file
- Value:
source: Data source module- Value:
Synthesizer - Description: Saves synthetic data generated from the Synthesizer module
- Can also choose other modules, such as
Preprocessor,Postprocessor, etc. - Default output file naming format:
petsard_Synthesizer[output].csv
- Value:
Execution Flow
- Loader loads the
adult-income.csvdata - Preprocessor performs preprocessing (impute missing values, handle outliers, encode, scale)
- Synthesizer generates synthetic data using the Gaussian Copula method
- Postprocessor restores synthetic data to original format (reverse scaling, reverse encoding, insert missing values)
- Reporter saves the final synthetic data as a CSV file
Advanced Usage
For custom parameters, please refer to the following documentation: