Evaluation of External Synthetic Data with Default Parameters
External synthesis with default evaluation. Enabling users to evaluate synthetic data from external solutions.
Click the below button to run this example in Colab:
Splitter:
external_split:
method: custom_data
filepath:
ori: benchmark/adult-income_ori.csv
control: benchmark/adult-income_control.csv
Synthesizer:
external_data:
method: custom_data
filepath: benchmark/adult-income_syn.csv
Evaluator:
validity_check:
method: sdmetrics-diagnosticreport
fidelity_check:
method: sdmetrics-qualityreport
singling_out_risk:
method: anonymeter-singlingout
linkability_risk:
method: anonymeter-linkability
aux_cols:
-
- workclass
- education
- occupation
- race
- gender
-
- age
- marital-status
- relationship
- native-country
- income
inference_risk:
method: 'anonymeter-inference'
secret: 'income'
classification_utility:
method: mlutility
task_type: classification
target: income
Reporter:
data:
method: 'save_data'
source: 'Postprocessor'
rpt:
method: 'save_report'
granularity:
- 'global'
- 'columnwise'
- 'pairwise'
- 'details'
YAML Parameters Detailed Explanation
Splitter (Data Splitting Module) - Using External Pre-split Data
This example uses the custom_data
method to load externally pre-split datasets, which differs from the automatic splitting used in Default Synthesis and Evaluation.
external_split
: Experiment name, can be freely namedmethod
: Data splitting method- Value:
custom_data
- Description: Used to load externally provided pre-split datasets instead of automatic splitting
- Use case: When you already have pre-split training and testing sets
- Value:
filepath
: Data file pathsori
: Training set (original data) path- Value:
benchmark/adult-income_ori.csv
- Description: Dataset used for training the synthesis model
- Value:
control
: Testing set (control data) path- Value:
benchmark/adult-income_control.csv
- Description: Independent testing set for privacy risk evaluation
- Value:
Important Notes:
- Training and testing sets must be completely independent with no overlapping rows
- Recommended split ratio: 80% training set, 20% testing set
- Testing set should not be used in the synthetic data generation process
Synthesizer (Synthetic Data Loading Module) - Using External Synthetic Data
This example uses the custom_data
method to load synthetic data generated by external tools, which is the main difference from Default Synthesis and Evaluation.
external_data
: Experiment name, can be freely namedmethod
: Synthesis method- Value:
custom_data
- Description: Used to load synthetic data generated by external tools (such as SDV, CTGAN, etc.)
- This method does not perform synthesis, only loads existing synthetic data for evaluation
- Value:
filepath
: Synthetic data file path- Value:
benchmark/adult-income_syn.csv
- Description: Location of the synthetic data file generated by external tools
- Value:
Important Notes:
- Synthetic data must be generated based only on the training set (
ori
) - Should not use information from the testing set (
control
) to generate synthetic data - This ensures the accuracy of privacy evaluation
Evaluator, Reporter
For parameter descriptions of these modules, please refer to the Default Synthesis and Evaluation.
Why No Loader, Preprocessor, Postprocessor?
In the external synthesis evaluation scenario:
- No Loader needed: Data loading is handled by Splitter’s
custom_data
method - No Preprocessor needed: Preprocessing should be completed in the external synthesis tool
- No Postprocessor needed: Synthetic data should already be in final format
Execution Flow
- Splitter loads pre-split data:
- Training set:
adult-income_ori.csv
- Testing set:
adult-income_control.csv
- Training set:
- Synthesizer loads synthetic data generated by external tools:
adult-income_syn.csv
- Evaluator performs multiple evaluations:
- Data validity diagnosis
- Privacy risk assessment (singling out, linkability, inference)
- Data fidelity assessment
- Machine learning utility assessment
- Reporter saves synthetic data and multi-level evaluation reports
External Data Preparation Overview
Pre-synthesized data evaluation requires attention to three key components:
- Training Set - used for synthetic data generation
- Testing Set - for privacy risk evaluation
- Synthetic Data - based only on the training set
Note: Using both training and testing data for synthesis would affect the accuracy of privacy evaluation.
External Data Requirements
Splitter
:
method: 'custom_data'
: For pre-split datasets provided externallyfilepath
: Points to original (ori
) and control (control
) datasets- Recommended ratio: 80% training, 20% testing unless specific reasons otherwise
Synthesizer
:
method: 'custom_data'
: For externally generated synthetic datafilepath
: Points to pre-synthesized dataset- Must be generated using only the training portion of data
Evaluator
:
- Ensures fair comparison between different synthetic data solutions