PETsARD Gaussian Copula

An efficient Gaussian Copula synthesizer implemented with Numba JIT and PyTorch, supporting CPU/GPU hybrid computing and intelligent device selection.

Usage Examples

Click the below button to run this example in Colab:

Note: If using Colab, please see the runtime setup guide.

Loader:
  load_benchmark_with_schema:
    filepath: benchmark://adult-income
    schema: benchmark://adult-income_schema
Preprocessor:
  default:
    method: default

Synthesizer:

  petsard-gaussian-copula:
    method: petsard-gaussian-copula
    sample_num_rows: 1000  # Number of rows to generate, default: training data row count
    use_gpu: auto          # Device selection, default: auto (automatic)
    gpu_threshold: 50000   # Threshold for auto mode, default: 50,000

  sdv_gaussiancopula:
    method: custom_method
    module_path: sdv-custom-methods.py
    class_name: SDV_GaussianCopula

Postprocessor:
  default:
    method: default
Evaluator:
  eval_all_methods:
    method: sdmetrics-qualityreport
Reporter:
  save_comparison:
    method: save_report
    granularity: global

Parameters

method (string, required) - Fixed value: petsard-gaussian-copula
sample_num_rows (integer, optional) - Number of synthetic data rows to generate. If not specified, uses training data row count
use_gpu (string or boolean, optional, default "auto") - Device selection mode:
- "auto" (default): Automatically selects device based on data size. Uses GPU when data exceeds gpu_threshold and GPU is available, otherwise uses CPU
- true: Force GPU usage. Raises error if GPU is unavailable
- false: Force CPU usage
gpu_threshold (integer, optional, default 50,000) - When use_gpu="auto", use GPU if data exceeds this row count

Algorithm Principles

Gaussian Copula separates marginal distributions from correlation structure through the following steps:

Marginal Transformation - Use empirical CDF to transform to uniform distribution: $u_i = F_i(X_i) = \frac{\text{rank}(X_i)}{n}$
Gaussianization - Use standard normal quantile function: $z_i = \Phi^{-1}(u_i)$
Correlation Learning - Learn correlation matrix in Gaussian space: $\Sigma = \text{corr}(\mathbf{Z})$
Joint Sampling - Sample from multivariate normal distribution: $\mathbf{Z}^* \sim \mathcal{N}(\mathbf{0}, \Sigma)$
Inverse Transformation - Transform back to original space: $u_i^* = \Phi(z_i^), \quad X_i^ = F_i^{-1}(u_i^*)$

Mathematical representation: $H(\mathbf{X}) = \Phi_{\Sigma}\left(\Phi^{-1}(F_1(X_1)), \ldots, \Phi^{-1}(F_D(X_D))\right)$

Implementation Features

Hybrid Computing Architecture

PETsARD adopts NumPy + Numba JIT + PyTorch architecture, using the most suitable tool for each stage:

Stage	Tool	Speedup
Transform	NumPy + Numba JIT	~700x after JIT compilation
Correlation	NumPy	Fast and stable
Regularization	NumPy (Ledoit-Wolf)	Avoids eigenvalue decomposition
Sampling	NumPy	~100x faster than PyTorch CPU
Inverse Transform	NumPy + Numba JIT	JIT-compiled linear interpolation
GPU Operations	PyTorch	Large dataset acceleration

Core Optimization Techniques

Numba JIT Compilation - Custom rank calculation and linear interpolation, 2-3x faster than standard implementation, 10-100x faster after compilation
Intelligent Device Selection - Small data (< 50K rows) uses CPU to avoid transfer overhead, large data uses GPU acceleration
Identity Fast Path - Uses ultra-fast independent sampling when variables are detected as independent
Ledoit-Wolf Regularization - Uses $\Sigma_{\text{reg}} = (1 - \lambda)\Sigma + \lambda I$, eigenvalue decomposition only when necessary

Differences from Other Implementations

Same: Standard Gaussian Copula algorithm, using the same statistical methods

Different: PETsARD adds Numba JIT acceleration, PyTorch GPU support, intelligent device selection, and other engineering optimizations

Data Requirements

✅ Categorical variables already encoded as integers (0, 1, 2, …)
✅ All columns are numeric types (int, float, datetime)
❌ string/object types not accepted

Generation process uses float64, then automatically restores original types (rounding integers, converting datetime types, etc.).

Performance and Limitations

Performance Reference

Small data (< 10K rows): Training ~1s, Generation ~0.5s
Medium data (10K-50K rows): Training ~2-3s, Generation ~1s
Large data (> 50K rows): Automatically switches to GPU, 2-5x speedup

ℹ️

First run requires Numba JIT compilation (~2s), subsequent runs will be much faster.

Limitations

Primarily captures linear correlations (Pearson correlation coefficient), non-linear relationships may not be fully reproduced
Complex conditional dependencies are simplified to joint Gaussian distribution
Correlation matrix size is O(n²), requires significant memory for >1000 columns

References

Nelsen, R. B. (2006). An Introduction to Copulas (2nd ed.). Springer. https://doi.org/10.1007/0-387-28678-0

Synthesizer YAML Custom Data