Evaluation Interpretation: Purpose-Driven Assessment
After completing data preparation, evaluating the quality of synthetic data is a critical step to ensure it meets application requirements. Evaluation strategies should be determined based on the intended use of synthetic data, as different application scenarios require different evaluation focuses and standards. This chapter will help you select appropriate evaluation methods and parameter settings based on data usage.
Quality assessment of synthetic data encompasses three core aspects:
- Privacy Protection: Ensuring synthetic data does not leak personally identifiable information from the original data
- Data Fidelity: Measuring the similarity of synthetic data to original data in statistical properties
- Data Utility: Verifying the performance of synthetic data in specific machine learning tasks
For these three aspects, our team recommends always prioritizing privacy protection, then determining the importance of the other two based on different application scenarios:
- Data Release Scenarios: When synthetic data will be publicly released or shared with third parties, pursue high fidelity to maintain data versatility
- Specific Task Modeling: When synthetic data is used for specific machine learning tasks (such as data augmentation, model training), pursue high utility to meet task requirements
flowchart TD
Start([開始評估])
Diagnostic{Step 1:<br/>資料診斷性通過?}
DiagnosticFail[資料結構問題<br/>需檢查合成過程]
Privacy{Step 2:<br/>隱私保護力通過?}
PrivacyFail[隱私風險過高<br/>需調整合成參數]
Purpose{Step 3:<br/>合成資料使用目的?}
Release[情境 A:<br/>資料釋出<br/>無特定下游任務]
Task[情境 B:<br/>特定任務應用<br/>資料增益/模型訓練]
FidelityFocus[評估重點:<br/>追求最高保真度]
UtilityFocus[評估重點:<br/>追求高實用性<br/>保真度達標即可]
Start --> Diagnostic
Diagnostic -->|否| DiagnosticFail
Diagnostic -->|是| Privacy
Privacy -->|否| PrivacyFail
Privacy -->|是| Purpose
Purpose -->|A| Release
Purpose -->|B| Task
Release --> FidelityFocus
Task --> UtilityFocus
style Start fill:#e1f5fe
style DiagnosticFail fill:#ffcdd2
style PrivacyFail fill:#ffcdd2
style FidelityFocus fill:#c8e6c9
style UtilityFocus fill:#c8e6c9Chapter Navigation
1. Privacy Risk Estimation: Protection Parameter Configuration
Privacy protection is the primary key to synthetic data quality assessment. This section explains how to use the Anonymeter tool to evaluate three privacy attack modes (singling out, linkability, inference), and provides parameter configuration recommendations and risk interpretation standards.
2. Release or Modeling: Fidelity or Utility
Select fidelity or utility as the primary evaluation aspect based on the intended use of synthetic data. This section explains that data release scenarios should pursue high fidelity, specific task modeling should pursue high utility, and how to conduct evaluation and interpretation.
3. Synthetic Data Modeling Use: Experiment Design Selection
When synthetic data is used for specific machine learning tasks, experiment design determines how to train and evaluate models. This section explains the differences between domain transfer and dual model control group designs, selection criteria, and application scenarios.