SDV Methods
PETsARD integrates the SDV (Synthetic Data Vault) package, providing various advanced synthetic data generation algorithms.
⚠️
Important Notice: The built-in SDV integration is planned for deprecation in the future. We recommend using SDV Custom Methods for better flexibility and long-term support.
ℹ️
Note: This document only provides YAML configuration examples, without Jupyter notebook examples.
Usage Examples
Loader:
load_benchmark_with_schema:
filepath: benchmark://adult-income
schema: benchmark://adult-income_schema
Synthesizer:
gaussian:
method: sdv-single_table-gaussiancopula
ctgan:
method: sdv-single_table-ctgan
copulagan:
method: sdv-single_table-copulagan
tvae:
method: sdv-single_table-tvae
Methods Overview
Method | method Setting | Features | GPU |
---|---|---|---|
GaussianCopula | default or sdv-single_table-gaussiancopula | Fast, suitable for large data | ✗ |
CTGAN | sdv-single_table-ctgan | High quality, complex patterns | ✓ |
CopulaGAN | sdv-single_table-copulagan | Balances statistics & deep learning | ✓ |
TVAE | sdv-single_table-tvae | Stable training, fast convergence | ✓ |
Method Details
GaussianCopula
Classical statistical distribution-based method, fast execution suitable for quick prototyping.
Features:
- ✓ Fast, suitable for large data
- ✓ Low computational requirements
- ✗ Primarily captures linear correlations
CTGAN
GAN-based deep learning method with best generation quality.
Features:
- ✓ High-quality synthetic data
- ✓ Suitable for complex patterns
- ✗ Longer training time
Default Parameters:
epochs
: 300batch_size
: 500generator_lr
: 0.0002discriminator_lr
: 0.0002
CopulaGAN
Combines Copula statistics with GAN, suitable for mixed-type data.
Features:
- ✓ Balances statistics & deep learning
- ✓ Better marginal distribution simulation
- ✓ Suitable for continuous & discrete mixed data
Default Parameters:
epochs
: 300batch_size
: 500default_distribution
: beta
TVAE
VAE-based generative model with stable training process.
Features:
- ✓ Stable training process
- ✓ Better convergence
- ✓ Suitable for medium-scale data
Default Parameters:
epochs
: 300batch_size
: 500encoder_layers
: [128, 128]decoder_layers
: [128, 128]
Automatic Features
Schema Conversion
PETsARD automatically converts internal Schema to SDV Metadata
Automatic Parameters
All methods automatically enable:
enforce_rounding
: Integer roundingenforce_min_max_values
: Value range enforcement (GaussianCopula, TVAE)
GPU Detection
Deep learning methods (CTGAN, CopulaGAN, TVAE) automatically detect and use GPU.
Selection Guide
Scenario | Recommended Method |
---|---|
Quick testing | GaussianCopula |
High quality needs | CTGAN |
Mixed-type data | CopulaGAN |
Medium data | TVAE |
Large data | GaussianCopula |
Available Distributions
GaussianCopula and CopulaGAN support:
norm
: Normal distributiontruncnorm
: Truncated normal distribution (default)beta
: Beta distributiongamma
: Gamma distributionuniform
: Uniform distributiongaussian_kde
: Kernel density estimation
Limitations
Built-in Integration Limits
- ✗ Cannot adjust training parameters (epochs, batch_size, etc.)
- ✗ Cannot specify distribution types
- ✗ Cannot manually select CPU/GPU
Important Notes
- Deep learning methods train faster on GPU
- Default 300 epochs, CPU training may be time-consuming
- Large datasets with deep learning require significant memory
- Built-in integration uses fixed parameters, cannot be adjusted