PETsARD YAML
Why Does PETsARD Use YAML?
PETsARD adopts YAML as its primary configuration method, allowing you to accomplish most tasks without writing Python code.
- No programming required: Execute complete synthesis and evaluation workflows just by writing configuration files
- Easy version control: Plain text format, convenient for tracking changes and team collaboration
- Batch processing: One configuration file can define multiple experiments and operations
- Reusable: Configuration files can be easily shared and reused
PETsARD YAML Basic Structure
PETsARD’s YAML configuration adopts a three-layer architecture:
Module_Name: # Layer 1: Module
Experiment_Name: # Layer 2: Experiment
parameter1: value # Layer 3: Parameters
parameter2: value
Module Level
The top level defines processing modules arranged in execution order:
Strongly Recommended Execution Order
We strongly recommend configuring modules in the following order. We are not responsible for execution results caused by changing the order.
- Executor: Execution settings (logging, working directory, etc.)
- Loader: Data loading
- Splitter: Data splitting
- Preprocessor: Data preprocessing
- Synthesizer: Data synthesis
- Postprocessor: Data postprocessing
- Constrainer: Data constraints
- Describer: Data description
- Evaluator: Result evaluation
- Reporter: Report generation
Module Execution Limitation
Currently, each module can only be executed once.
Experiment Level
Each module can have multiple experiment configurations. Experiment names are custom and can be named according to purpose:
Synthesizer:
gaussian-copula: # Using Gaussian Copula method
method: 'sdv-single_table-gaussiancopula'
ctgan: # Using CTGAN method
method: 'sdv-single_table-ctgan'
tvae: # Using TVAE method
method: 'sdv-single_table-tvae'
Multiple experiments within the same module execute sequentially, allowing you to:
- Compare effects of different methods
- Test different parameter settings
- Perform batch processing
Parameter Level
Each experiment contains specific parameter settings. Different methods have different parameter requirements.
Complete Example
# A complete PETsARD configuration example
Loader:
data:
filepath: 'benchmark/adult-income.csv'
Preprocessor:
demo:
method: 'default'
Synthesizer:
gaussian-copula:
method: 'sdv-single_table-gaussiancopula'
ctgan:
method: 'sdv-single_table-ctgan'
tvae:
method: 'sdv-single_table-tvae'
Postprocessor:
demo:
method: 'default'
Evaluator:
quality-report:
method: 'sdmetrics-qualityreport'
Reporter:
save-data:
method: 'save_data'
source: 'Synthesizer'
This example demonstrates:
- Loading data (Loader)
- Data preprocessing (Preprocessor)
- Synthesizing data using three different methods (Synthesizer)
- Data postprocessing (Postprocessor)
- Evaluating synthetic data quality (Evaluator)
- Saving results (Reporter)
Execution Flow
When multiple experiments are defined, PETsARD executes all module combinations in a depth-first manner:
Loader → Splitter → Preprocessor → Synthesizer → Postprocessor → Constrainer → Describer → Evaluator → Reporter
Depth-First Execution Order
Depth-first means PETsARD will complete all modules for the first experiment combination before starting the second one. It’s like walking one path to the end before returning to walk another path.
Example:
Combination 1: Loader(A) → Synthesizer(method_a) → Evaluator → Reporter
Combination 2: Loader(A) → Synthesizer(method_b) → Evaluator → Reporter
Combination 3: Loader(B) → Synthesizer(method_a) → Evaluator → Reporter
Combination 4: Loader(B) → Synthesizer(method_b) → Evaluator → Reporter
Execution Tree:
graph TD Start[Start] --> A[A] Start --> B[B] A --> A1["① A+method_a"] A --> A2["② A+method_b"] B --> B1["③ B+method_a"] B --> B2["④ B+method_b"] style A1 fill:#e1f5e1 style A2 fill:#e1f5e1 style B1 fill:#e1f5e1 style B2 fill:#e1f5e1
Legend:
- ① ② ③ ④: Execution order numbers
- Green boxes: Complete experiment combinations (will execute all subsequent modules)
- Arrows: Data flow direction
Execution order: Complete all modules for combination ① → then combination ② → and so on
Not breadth-first: executing Loader for all combinations first, then Preprocessor for all combinations…
Experiment Combinations
If multiple experiments are defined in different modules, PETsARD generates all possible combinations. For example:
Loader:
load_a:
filepath: 'data1.csv'
load_b:
filepath: 'data2.csv'
Synthesizer:
method_a:
method: 'method-a'
method_b:
method: 'method-b'
This generates four experiment combinations:
- load_a + method_a
- load_a + method_b
- load_b + method_a
- load_b + method_b
Each combination executes the complete workflow once, allowing you to systematically compare the effects of different configurations.
Output Results
Output results require the use of the Reporter module. Please refer to the Reporter module documentation for detailed configuration methods.
Best Practices
Following these recommendations will make your YAML configurations more readable and maintainable:
Use meaningful experiment names
- Good:
ctgan_epochs100
,preprocessing_with_scaling
- Avoid:
exp1
,test
,a
- Good:
Organize parameters by module
- Keep related parameters together
- Maintain consistent indentation (usually 2 or 4 spaces)
Add comments to experiment configurations
Synthesizer: ctgan: # Using CTGAN for tabular data synthesis method: 'sdv-single_table-ctgan' epochs: 300 # Increase training epochs to improve quality
Validate YAML syntax before execution
- Ensure correct indentation (YAML is indentation-sensitive)
- Check for spaces after colons
- Verify quote pairing
Leverage multi-experiment comparisons
- Define multiple experiments in the same configuration file for comparison
- Use consistent naming conventions for easy identification
Keep configuration files concise
- Only set parameters that need to be changed
- Use default values for other parameters