Evaluation of External Synthetic Data with Default Parameters

Evaluation of External Synthetic Data with Default Parameters

External synthesis with default evaluation. Enabling users to evaluate synthetic data from external solutions.

Click the below button to run this example in Colab:

Open In Colab

Splitter:
  external_split:
    method: custom_data
    filepath:
      ori: benchmark/adult-income_ori.csv
      control: benchmark/adult-income_control.csv
Synthesizer:
  external_data:
    method: custom_data
    filepath: benchmark/adult-income_syn.csv
Evaluator:
  validity_check:
    method: sdmetrics-diagnosticreport
  fidelity_check:
    method: sdmetrics-qualityreport
  singling_out_risk:
    method: anonymeter-singlingout
  linkability_risk:
    method: anonymeter-linkability
    aux_cols:
      -
        - workclass
        - education
        - occupation
        - race
        - gender
      -
        - age
        - marital-status
        - relationship
        - native-country
        - income
  inference_risk:
    method: 'anonymeter-inference'
    secret: 'income'
  classification_utility:
    method: mlutility
    task_type: classification
    target: income
Reporter:
  data:
    method: 'save_data'
    source: 'Postprocessor'
  rpt:
    method: 'save_report'
    granularity:
      - 'global'
      - 'columnwise'
      - 'pairwise'
      - 'details'

YAML Parameters Detailed Explanation

Splitter (Data Splitting Module) - Using External Pre-split Data

This example uses the custom_data method to load externally pre-split datasets, which differs from the automatic splitting used in Default Synthesis and Evaluation.

  • external_split: Experiment name, can be freely named
  • method: Data splitting method
    • Value: custom_data
    • Description: Used to load externally provided pre-split datasets instead of automatic splitting
    • Use case: When you already have pre-split training and testing sets
  • filepath: Data file paths
    • ori: Training set (original data) path
      • Value: benchmark/adult-income_ori.csv
      • Description: Dataset used for training the synthesis model
    • control: Testing set (control data) path
      • Value: benchmark/adult-income_control.csv
      • Description: Independent testing set for privacy risk evaluation

Important Notes:

  • Training and testing sets must be completely independent with no overlapping rows
  • Recommended split ratio: 80% training set, 20% testing set
  • Testing set should not be used in the synthetic data generation process

Synthesizer (Synthetic Data Loading Module) - Using External Synthetic Data

This example uses the custom_data method to load synthetic data generated by external tools, which is the main difference from Default Synthesis and Evaluation.

  • external_data: Experiment name, can be freely named
  • method: Synthesis method
    • Value: custom_data
    • Description: Used to load synthetic data generated by external tools (such as SDV, CTGAN, etc.)
    • This method does not perform synthesis, only loads existing synthetic data for evaluation
  • filepath: Synthetic data file path
    • Value: benchmark/adult-income_syn.csv
    • Description: Location of the synthetic data file generated by external tools

Important Notes:

  • Synthetic data must be generated based only on the training set (ori)
  • Should not use information from the testing set (control) to generate synthetic data
  • This ensures the accuracy of privacy evaluation

Evaluator, Reporter

For parameter descriptions of these modules, please refer to the Default Synthesis and Evaluation.

Why No Loader, Preprocessor, Postprocessor?

In the external synthesis evaluation scenario:

  • No Loader needed: Data loading is handled by Splitter’s custom_data method
  • No Preprocessor needed: Preprocessing should be completed in the external synthesis tool
  • No Postprocessor needed: Synthetic data should already be in final format

Execution Flow

  1. Splitter loads pre-split data:
  2. Synthesizer loads synthetic data generated by external tools: adult-income_syn.csv
  3. Evaluator performs multiple evaluations:
    • Data validity diagnosis
    • Privacy risk assessment (singling out, linkability, inference)
    • Data fidelity assessment
    • Machine learning utility assessment
  4. Reporter saves synthetic data and multi-level evaluation reports

External Data Preparation Overview

Pre-synthesized data evaluation requires attention to three key components:

  1. Training Set - used for synthetic data generation
  2. Testing Set - for privacy risk evaluation
  3. Synthetic Data - based only on the training set

Note: Using both training and testing data for synthesis would affect the accuracy of privacy evaluation.

External Data Requirements

  1. Splitter:
  • method: 'custom_data': For pre-split datasets provided externally
  • filepath: Points to original (ori) and control (control) datasets
  • Recommended ratio: 80% training, 20% testing unless specific reasons otherwise
  1. Synthesizer:
  • method: 'custom_data': For externally generated synthetic data
  • filepath: Points to pre-synthesized dataset
  • Must be generated using only the training portion of data
  1. Evaluator:
  • Ensures fair comparison between different synthetic data solutions