Save Schema
Use the save_schema
method to export schema information from specified source modules to CSV (default) or YAML files.
Usage Examples
Click the below button to run this example in Colab:
Loader:
load_benchmark_with_schema:
filepath: benchmark://adult-income
schema: benchmark://adult-income_schema
Splitter:
basic_split:
num_samples: 1
train_split_ratio: 0.8
Preprocessor:
default:
method: 'default'
Synthesizer:
default:
method: 'default'
Postprocessor:
default:
method: 'default'
Reporter:
save_schema:
method: save_schema # Required: Fixed to save_schema method
source: # Required: Specify modules to extract schema from
- Loader
- Preprocessor
- Synthesizer
- Postprocessor
yaml_output: true # Optional: Output individual YAML files (default: false)
# output: petsard # Optional: Output file name prefix (default: petsard)
# properties: # Optional: Specify properties to output (default: all properties)
# - dtype
# - nullable
# - min
# - max
Main Parameters
Required Parameters
Parameter | Type | Description | Example |
---|---|---|---|
method | string | Fixed as save_schema (case-insensitive) | save_schema or SAVE_SCHEMA |
source | string or list | Target module name(s) | Loader or ["Loader", "Preprocessor"] |
source Parameter Details
The source
parameter specifies which modules’ schemas to export. Supported formats:
Single Module: Export schema from one module
source: Preprocessor
Multiple Modules: Export schemas from multiple modules
source: - Loader - Preprocessor - Synthesizer
Supported Modules:
Loader
: Original data schemaSplitter
: Split data schemaPreprocessor
: Preprocessed data schemaSynthesizer
: Synthetic data schemaPostprocessor
: Postprocessed data schemaConstrainer
: Constrained data schema
Reference Notes:
- Execution Order: Can only reference modules executed before the current Reporter
- Module Names: Must exactly match module names defined in the YAML configuration
- Data Availability: Schema extraction requires the module to have successfully executed
Optional Parameters
Parameter | Type | Default | Description | Example |
---|---|---|---|---|
output | string | petsard | Output file name prefix | my_experiment |
yaml_output | boolean | false | Whether to output YAML format additionally | true , false |
properties | string or list | All properties | Specify property names to output | dtype or ["dtype", "nullable", "min", "max"] |
properties Parameter Details
The properties
parameter filters which attributes to output. Only specified properties will appear in output files. Supported formats:
Single Property: Output only the specified property
properties: dtype
Multiple Properties: Output multiple specified properties
properties: - dtype - nullable - min - max
Common Properties:
dtype
: Data typenullable
: Whether null values are allowedmin
,max
,mean
,std
: Statistics for numeric columnscategories
: Category values for categorical columnsunique_count
: Number of unique values
Usage Example:
Reporter:
save_schema:
method: save_schema
source:
- Loader
- Synthesizer
properties:
- dtype
- nullable
output: filtered_schema
Effect:
- CSV file will only include
{column_name}_dtype
and{column_name}_nullable
columns - Other attributes (like min, max, categories, etc.) will not be output
- Applies to all columns regardless of their data type
Output Format
Default CSV Format (Summary)
Schema information is output as CSV format by default, with one row per source (experiment) and all column attributes expanded:
Filename format:
{output}_schema_{source1-source2-...}_summary.csv
The filename includes all source module names connected by hyphens, similar to the save_data
method.
Examples:
- With
output: "petsard"
andsource: ["Loader", "Preprocessor", "Synthesizer"]
:petsard_schema_Loader-Preprocessor-Synthesizer_summary.csv
- With
output: "petsard"
andsource: "Loader"
:petsard_schema_Loader_summary.csv
CSV Structure:
- First column:
source
(source experiment name) - Remaining columns:
{column_name}_{attribute_name}
, for example:age_dtype
: Data type of age columnage_nullable
: Whether age column allows null valuesage_min
,age_max
,age_mean
: Statistics for age columnworkclass_categories
: Category values for workclass column
When using properties parameter:
- Only specified properties will be output
- For example, with
properties: ["dtype", "nullable"]
, CSV will only includeage_dtype
,age_nullable
,workclass_dtype
,workclass_nullable
, etc. - Unspecified attributes (like min, max, categories) will not appear in the output
Advantages:
- Easy to compare schema differences across experiments
- Can be opened directly with Excel or other tools for analysis
- Suitable for version control and diff comparison
Optional YAML Format
When yaml_output: true
is set, individual YAML files for each experiment will be output additionally:
{output}_schema_{full_experiment_name}.yaml
Example output filenames:
petsard_schema_Loader[load_benchmark_with_schema].yaml
petsard_schema_Loader[load_benchmark_with_schema]_Preprocessor[scaler].yaml
Usage:
Reporter:
save_schema:
method: save_schema
source:
- Loader
- Preprocessor
yaml_output: true # Output YAML files additionally
Use Cases
- Data Transformation Tracking: Track how data structure changes through the processing pipeline
- Quality Assurance: Verify that synthetic data maintains expected structure
- Documentation Generation: Generate comprehensive data documentation for your project
Common Questions
Q: What’s the difference between save_schema and save_data?
A:
save_schema
: Exports data structure information (column types, statistics) in CSV format by default (flattened table), optionally YAMLsave_data
: Exports actual data content in CSV format
Q: Can I specify a custom save path?
A: Yes, use the output
parameter with a relative path. Files will be saved relative to the current working directory.
Q: Why isn’t my module’s schema being saved?
A: Check:
- Module name is spelled correctly in
source
- Module has been executed before Reporter
- Module has data available (not empty or failed)
Notes
- File Overwrite: Files with the same name will be overwritten
- Module Execution: Only modules that have successfully executed can have their schemas exported
- Encoding: All CSV and YAML files use UTF-8 encoding
- Case Insensitivity: The
method
parameter is case-insensitive (save_schema
,SAVE_SCHEMA
, orSave_Schema
all work) - Performance: Schema extraction is fast and doesn’t require loading full datasets
- Comparison: CSV format is especially suitable for comparing data structure changes across experiments
- Missing Values: If a column in one source doesn’t have a certain attribute (e.g., changed from numeric to categorical), that field will be left empty (NA)