SDV Methods

PETsARD integrates the SDV (Synthetic Data Vault) package, providing various advanced synthetic data generation algorithms.

⚠️
Important Notice: The built-in SDV integration is planned for deprecation in the future. We recommend using SDV Custom Methods for better flexibility and long-term support.
ℹ️
Note: This document only provides YAML configuration examples, without Jupyter notebook examples.

Usage Examples

Loader:
  load_benchmark_with_schema:
    filepath: benchmark://adult-income
    schema: benchmark://adult-income_schema

Synthesizer:
  gaussian:
    method: sdv-single_table-gaussiancopula

  ctgan:
    method: sdv-single_table-ctgan

  copulagan:
    method: sdv-single_table-copulagan

  tvae:
    method: sdv-single_table-tvae

Methods Overview

Methodmethod SettingFeaturesGPU
GaussianCopuladefault or sdv-single_table-gaussiancopulaFast, suitable for large data
CTGANsdv-single_table-ctganHigh quality, complex patterns
CopulaGANsdv-single_table-copulaganBalances statistics & deep learning
TVAEsdv-single_table-tvaeStable training, fast convergence

Method Details

GaussianCopula

Classical statistical distribution-based method, fast execution suitable for quick prototyping.

Features:

  • ✓ Fast, suitable for large data
  • ✓ Low computational requirements
  • ✗ Primarily captures linear correlations

CTGAN

GAN-based deep learning method with best generation quality.

Features:

  • ✓ High-quality synthetic data
  • ✓ Suitable for complex patterns
  • ✗ Longer training time

Default Parameters:

  • epochs: 300
  • batch_size: 500
  • generator_lr: 0.0002
  • discriminator_lr: 0.0002

CopulaGAN

Combines Copula statistics with GAN, suitable for mixed-type data.

Features:

  • ✓ Balances statistics & deep learning
  • ✓ Better marginal distribution simulation
  • ✓ Suitable for continuous & discrete mixed data

Default Parameters:

  • epochs: 300
  • batch_size: 500
  • default_distribution: beta

TVAE

VAE-based generative model with stable training process.

Features:

  • ✓ Stable training process
  • ✓ Better convergence
  • ✓ Suitable for medium-scale data

Default Parameters:

  • epochs: 300
  • batch_size: 500
  • encoder_layers: [128, 128]
  • decoder_layers: [128, 128]

Automatic Features

Schema Conversion

PETsARD automatically converts internal Schema to SDV Metadata

Automatic Parameters

All methods automatically enable:

  • enforce_rounding: Integer rounding
  • enforce_min_max_values: Value range enforcement (GaussianCopula, TVAE)

GPU Detection

Deep learning methods (CTGAN, CopulaGAN, TVAE) automatically detect and use GPU.

Selection Guide

ScenarioRecommended Method
Quick testingGaussianCopula
High quality needsCTGAN
Mixed-type dataCopulaGAN
Medium dataTVAE
Large dataGaussianCopula

Available Distributions

GaussianCopula and CopulaGAN support:

  • norm: Normal distribution
  • truncnorm: Truncated normal distribution (default)
  • beta: Beta distribution
  • gamma: Gamma distribution
  • uniform: Uniform distribution
  • gaussian_kde: Kernel density estimation

Limitations

Built-in Integration Limits

  • ✗ Cannot adjust training parameters (epochs, batch_size, etc.)
  • ✗ Cannot specify distribution types
  • ✗ Cannot manually select CPU/GPU

Important Notes

  1. Deep learning methods train faster on GPU
  2. Default 300 epochs, CPU training may be time-consuming
  3. Large datasets with deep learning require significant memory
  4. Built-in integration uses fixed parameters, cannot be adjusted