Custom Method: SDV

SDV (Synthetic Data Vault) is a mainstream and widely popular synthetic data generation package in the industry, offering various advanced algorithms. To help users understand how to integrate external packages in PETsARD, we have developed this demonstration showing how to flexibly use SDV’s various synthesis methods via the custom_method approach, with full control over all parameters including CPU/GPU selection and training parameters. This demonstration not only shows how to use SDV, but more importantly, provides a complete example of customizing external methods, allowing you to easily integrate other third-party packages or your own algorithms.

Example Usage

Click the button below to run the example in Colab:

Open In Colab

Loader:
  load_benchmark_with_schema:
    filepath: benchmark://adult-income
    schema: benchmark://adult-income_schema
    nrows: 100 # load first 1K rows for faster testing
Preprocessor:
  default:
    method: default

Synthesizer:
  # Method 1: GaussianCopula
  exp1_gaussiancopula:
    method: custom_method
    module_path: sdv-custom-methods.py
    class_name: SDV_GaussianCopula
    # Optional parameters:
    # device: cpu                      # Compute device (default: cpu)
    # default_distribution: truncnorm  # Default distribution for numerical columns (default: truncnorm)
    # enforce_min_max_values: true     # Enforce min/max constraints (default: true)
    # enforce_rounding: true           # Round integer columns (default: true)
    # numerical_distributions:         # Column-specific distributions (default: {})
    #   age: beta
    #   hours-per-week: gamma

  # Method 2: CTGAN
  exp2_ctgan:
    method: custom_method
    module_path: sdv-custom-methods.py
    class_name: SDV_CTGAN
    epochs: 2                       # Training epochs (default: 300)
    batch_size: 50                  # Batch size (default: 500) ⚠️ IMPORTANT: Must be divisible by 'pac' (default pac=10)
    verbose: true                   # Show training progress (default: false)
    generator_dim:                  # Generator layer sizes (default: (256, 256))
      - 16
      - 16
    discriminator_dim:              # Discriminator layer sizes (default: (256, 256))
      - 16
      - 16
    # Optional parameters:
    # device: cpu                   # Compute device (default: cpu)
    # pac: 10                       # PAC (Packing) size (default: 10)
    # discriminator_steps: 1        # Discriminator updates per generator update (default: 1)
    # generator_lr: 0.0002          # Generator learning rate (default: 0.0002)
    # discriminator_lr: 0.0002      # Discriminator learning rate (default: 0.0002)
    # enforce_rounding: true        # Round integer columns (default: true)

  # Method 3: CopulaGAN
  exp3_copulagan:
    method: custom_method
    module_path: sdv-custom-methods.py
    class_name: SDV_CopulaGAN
    epochs: 2                       # Training epochs (default: 300)
    batch_size: 50                  # Batch size (default: 500)
                                    # ⚠️ IMPORTANT: Must be divisible by 'pac' (default pac=10)
                                    # Valid values: 10, 20, 30, 40, 50, 60, 70, 80, 90, 100, etc.
    # pac: 10                       # PAC (Packing) size (default: 10)
                                    # Controls how samples are packed together
                                    # batch_size must be divisible by this value
    verbose: true                   # Show training progress (default: false)
    generator_dim:                  # Generator layer sizes (default: (256, 256))
      - 16
      - 16
    discriminator_dim:              # Discriminator layer sizes (default: (256, 256))
      - 16
      - 16
    # Optional parameters:
    # device: cpu                   # Compute device (default: cpu)
    # default_distribution: beta    # Default distribution for numerical columns (default: beta)
    # discriminator_steps: 1        # Discriminator updates per generator update (default: 1)
    # generator_lr: 0.0002          # Generator learning rate (default: 0.0002)
    # discriminator_lr: 0.0002      # Discriminator learning rate (default: 0.0002)
    # enforce_rounding: true        # Round integer columns (default: true)
    # numerical_distributions:      # Column-specific distributions (default: {})
    #   age: beta

  # Method 4: TVAE
  exp4_tvae:
    method: custom_method
    module_path: sdv-custom-methods.py
    class_name: SDV_TVAE
    epochs: 2                       # Training epochs (default: 300)
    batch_size: 50                  # Batch size (default: 500) ⚠️ IMPORTANT: Must be divisible by 'pac' (default pac=10)
                                    # Note: TVAE doesn't use 'pac', but keeping batch_size=50 for consistency
    verbose: true                   # Show training progress (default: false)
    encoder_layers:                 # Encoder layer sizes (default: (128, 128))
      - 16
      - 16
    decoder_layers:                 # Decoder layer sizes (default: (128, 128))
      - 16
      - 16
    embedding_dim: 32               # Embedding dimension (default: 128)
    # Optional parameters:
    # device: cpu                   # Compute device (default: cpu)
    # encoder_layers: [128, 128]    # Encoder layer sizes (default: (128, 128))
    # decoder_layers: [128, 128]    # Decoder layer sizes (default: (128, 128))
    # enforce_min_max_values: true  # Enforce min/max constraints (default: true)
    # enforce_rounding: true        # Round integer columns (default: true)
    # verbose: false                # Show training progress (default: false)
    # l2scale: 0.00001              # L2 regularization coefficient (default: 1e-5)
    # loss_factor: 2                # Reconstruction loss factor (default: 2)

Postprocessor:
  default:
    method: default
Evaluator:
  eval_all_methods:
    method: sdmetrics-qualityreport
Reporter:
  save_comparison:
    method: save_report
    granularity: global

Prerequisites

1. Install SDV

pip install sdv

2. Download Script

Download sdv-custom-methods.py to the same directory as your YAML file.

3. Check CUDA Availability (Optional)

python check_cuda.py

Supported Methods

1. SDV_GaussianCopula

Fast statistical method suitable for large datasets and rapid prototyping.

Features:

  • No training iterations required, fastest method
  • CPU-only (does not use GPU)
  • Customizable distribution types for numerical columns

Parameters:

ParameterTypeDefaultDescription
devicestringcpuCompute device (only cpu supported)
default_distributionstringtruncnormDefault distribution for numerical columns
numerical_distributionsdict{}Specify distributions for specific columns
enforce_min_max_valuesbooltrueEnforce numerical range constraints
enforce_roundingbooltrueRound integer columns

Available Distribution Types:

  • truncnorm: Truncated normal distribution
  • beta: Beta distribution
  • gamma: Gamma distribution
  • uniform: Uniform distribution
  • norm: Normal distribution

Example:

exp_gaussiancopula:
  method: custom_method
  module_path: sdv-custom-methods.py
  class_name: SDV_GaussianCopula
  device: cpu
  default_distribution: truncnorm
  numerical_distributions:
    age: beta
    hours-per-week: gamma

2. SDV_CTGAN

GAN-based deep learning method providing highest quality synthetic data.

Features:

  • High-quality synthetic data
  • GPU acceleration support
  • Full control over training parameters

Parameters:

ParameterTypeDefaultDescription
devicestringcpuCompute device (cpu or cuda)
epochsint300Training epochs
batch_sizeint500Batch size
discriminator_stepsint1Discriminator updates per generator update
generator_lrfloat0.0002Generator learning rate
discriminator_lrfloat0.0002Discriminator learning rate
generator_dimlist[256, 256]Generator hidden layer sizes
discriminator_dimlist[256, 256]Discriminator hidden layer sizes
enforce_roundingbooltrueRound integer columns
verboseboolfalseShow training progress

Example:

exp_ctgan:
  method: custom_method
  module_path: sdv-custom-methods.py
  class_name: SDV_CTGAN
  device: cuda        # Use GPU acceleration
  epochs: 100         # Fewer epochs for quick testing
  batch_size: 500
  verbose: true       # Show training progress

3. SDV_CopulaGAN

Hybrid model combining statistical methods and GANs.

Features:

  • Balances statistical accuracy and generation quality
  • Better marginal distribution fitting
  • GPU acceleration support

Parameters:

ParameterTypeDefaultDescription
devicestringcpuCompute device (cpu or cuda)
epochsint300Training epochs
batch_sizeint500Batch size
default_distributionstringbetaDefault distribution for numerical columns
numerical_distributionsdict{}Specify distributions for specific columns
discriminator_stepsint1Discriminator updates per generator update
generator_lrfloat0.0002Generator learning rate
discriminator_lrfloat0.0002Discriminator learning rate
enforce_roundingbooltrueRound integer columns
verboseboolfalseShow training progress

Example:

exp_copulagan:
  method: custom_method
  module_path: sdv-custom-methods.py
  class_name: SDV_CopulaGAN
  device: cpu
  epochs: 200
  default_distribution: beta

4. SDV_TVAE

VAE-based generative model with stable training process.

Features:

  • Stable training process
  • Better convergence
  • GPU acceleration support

Parameters:

ParameterTypeDefaultDescription
devicestringcpuCompute device (cpu or cuda)
epochsint300Training epochs
batch_sizeint500Batch size
encoder_layerslist[128, 128]Encoder hidden layer sizes
decoder_layerslist[128, 128]Decoder hidden layer sizes
embedding_dimint128Embedding dimension size
l2scalefloat0.00001L2 regularization coefficient
loss_factorint2Reconstruction loss factor
enforce_min_max_valuesbooltrueEnforce numerical range constraints
enforce_roundingbooltrueRound integer columns
verboseboolfalseShow training progress

Example:

exp_tvae:
  method: custom_method
  module_path: sdv-custom-methods.py
  class_name: SDV_TVAE
  device: cpu
  epochs: 300
  batch_size: 1000
  encoder_layers: [256, 256]
  decoder_layers: [256, 256]

Device Parameter Explanation

CPU Mode (Default)

device: cpu  # Use CPU for training

Characteristics:

  • No GPU hardware required
  • Suitable for small datasets
  • Slower training speed

CUDA Mode (Requires GPU)

device: cuda  # Use GPU for training

Characteristics:

  • Requires NVIDIA GPU
  • Significantly faster training (2-10x speedup)
  • Suitable for large datasets and long training

CUDA Validation Mechanism:

  • If cuda is specified but not available, program will immediately terminate with error message
  • Use check_cuda.py to check CUDA availability beforehand

Method Support

MethodCPUCUDA
SDV_GaussianCopula✗ (will warn and use CPU)
SDV_CTGAN
SDV_CopulaGAN
SDV_TVAE

Parameter Tuning Recommendations

Quick Testing Configuration

epochs: 10-50        # Quick results
batch_size: 500      # Standard batch size
verbose: true        # Observe training progress

Production Configuration

epochs: 300-500      # Sufficient training
batch_size: 500-1000 # Adjust based on memory
verbose: false       # Reduce output
device: cuda         # Use GPU if available

Memory Optimization

batch_size: 250-500  # Reduce batch size
epochs: 300          # Maintain sufficient training

FAQ

Q: How do I know if my environment supports CUDA?

Run the check script:

python check_cuda.py

Q: What if CTGAN/CopulaGAN/TVAE training is too slow?

  1. Reduce epochs: Lower from 300 to 50-100 for testing
  2. Use GPU: Set device: cuda (if available)
  3. Increase batch_size: Can speed up each epoch

Q: Can GaussianCopula use GPU?

No, GaussianCopula is a pure statistical method without neural networks, so it doesn’t support GPU. If device: cuda is set, the system will warn and automatically use CPU.

Q: How to choose the right method?

NeedRecommended Method
Quick testingSDV_GaussianCopula
Highest qualitySDV_CTGAN (requires GPU)
Balanced quality/speedSDV_CopulaGAN
Training stabilitySDV_TVAE
No GPU environmentSDV_GaussianCopula or reduce epochs

File Sources

Related Documentation