Evaluation of External Synthetic Data with Default Parameters

External synthesis with default evaluation. Enabling users to evaluate synthetic data from external solutions.

Click the below button to run this example in Colab:

Note: If using Colab, please see the runtime setup guide.

Splitter:
  external_split:
    method: custom_data
    filepath:
      ori: benchmark/adult-income_ori.csv
      control: benchmark/adult-income_control.csv
Synthesizer:
  external_data:
    method: custom_data
    filepath: benchmark/adult-income_syn.csv
Evaluator:
  validity_check:
    method: sdmetrics-diagnosticreport
  fidelity_check:
    method: sdmetrics-qualityreport
  singling_out_risk:
    method: anonymeter-singlingout
  linkability_risk:
    method: anonymeter-linkability
    aux_cols:
      -
        - workclass
        - education
        - occupation
        - race
        - gender
      -
        - age
        - marital-status
        - relationship
        - native-country
        - income
  inference_risk:
    method: 'anonymeter-inference'
    secret: 'income'
  classification_utility:
    method: mlutility
    task_type: classification
    target: income
Reporter:
  data:
    method: 'save_data'
    source: 'Postprocessor'
  rpt:
    method: 'save_report'
    granularity:
      - 'global'
      - 'columnwise'
      - 'pairwise'
      - 'details'

YAML Parameters Detailed Explanation

Splitter (Data Splitting Module) - Using External Pre-split Data

This example uses the custom_data method to load externally pre-split datasets, which differs from the automatic splitting used in Default Synthesis and Evaluation.

external_split: Experiment name, can be freely named
method: Data splitting method
- Value: custom_data
- Description: Used to load externally provided pre-split datasets instead of automatic splitting
- Use case: When you already have pre-split training and testing sets
filepath: Data file paths
- ori: Training set (original data) path
  - Value: benchmark/adult-income_ori.csv
  - Description: Dataset used for training the synthesis model
- control: Testing set (control data) path
  - Value: benchmark/adult-income_control.csv
  - Description: Independent testing set for privacy risk evaluation

Important Notes:

Training and testing sets must be completely independent with no overlapping rows
Recommended split ratio: 80% training set, 20% testing set
Testing set should not be used in the synthetic data generation process

Synthesizer (Synthetic Data Loading Module) - Using External Synthetic Data

This example uses the custom_data method to load synthetic data generated by external tools, which is the main difference from Default Synthesis and Evaluation.

external_data: Experiment name, can be freely named
method: Synthesis method
- Value: custom_data
- Description: Used to load synthetic data generated by external tools (such as SDV, CTGAN, etc.)
- This method does not perform synthesis, only loads existing synthetic data for evaluation
filepath: Synthetic data file path
- Value: benchmark/adult-income_syn.csv
- Description: Location of the synthetic data file generated by external tools

Important Notes:

Synthetic data must be generated based only on the training set (ori)
Should not use information from the testing set (control) to generate synthetic data
This ensures the accuracy of privacy evaluation

Evaluator, Reporter

For parameter descriptions of these modules, please refer to the Default Synthesis and Evaluation.

Why No Loader, Preprocessor, Postprocessor?

In the external synthesis evaluation scenario:

No Loader needed: Data loading is handled by Splitter’s custom_data method
No Preprocessor needed: Preprocessing should be completed in the external synthesis tool
No Postprocessor needed: Synthetic data should already be in final format

Execution Flow

Splitter loads pre-split data:
- Training set: adult-income_ori.csv
- Testing set: adult-income_control.csv
Synthesizer loads synthetic data generated by external tools: adult-income_syn.csv
Evaluator performs multiple evaluations:
- Data validity diagnosis
- Privacy risk assessment (singling out, linkability, inference)
- Data fidelity assessment
- Machine learning utility assessment
Reporter saves synthetic data and multi-level evaluation reports

External Data Preparation Overview

Pre-synthesized data evaluation requires attention to three key components:

Training Set - used for synthetic data generation
Testing Set - for privacy risk evaluation
Synthetic Data - based only on the training set

Note: Using both training and testing data for synthesis would affect the accuracy of privacy evaluation.

External Data Requirements

Splitter:

method: 'custom_data': For pre-split datasets provided externally
filepath: Points to original (ori) and control (control) datasets
Recommended ratio: 80% training, 20% testing unless specific reasons otherwise

Synthesizer:

method: 'custom_data': For externally generated synthetic data
filepath: Points to pre-synthesized dataset
Must be generated using only the training portion of data

Evaluator:

Ensures fair comparison between different synthetic data solutions

Data Synthesis and Evaluation with Default Parameters Getting Started