Describer YAML

Describer YAML

YAML configuration file format for the Describer module. Provides statistical description and comparison functionality for datasets.

Usage Examples

Click the button below to run the example in Colab:

Open In Colab

Single Dataset Description (describe mode)

---
Synthesizer:
  external_data:
    method: custom_data
    filepath: benchmark://adult-income_syn
    schema: benchmark://adult-income_schema
Describer:
  describer-describe:
    method: default    # Auto-detects as describe (single source)
    source: Synthesizer
...

Dataset Comparison (compare mode)

---
Splitter:
  external_split:
    method: custom_data
    filepath:
      ori: benchmark://adult-income_ori
      control: benchmark://adult-income_control
    schema:
      ori: benchmark://adult-income_schema
      control: benchmark://adult-income_schema
Synthesizer:
  external_data:
    method: custom_data
    filepath: benchmark://adult-income_syn
    schema: benchmark://adult-income_schema
Describer:
  describer-compare:
    method: default         # Auto-detects as compare (two sources)
    source:
      base: Splitter.train  # Use Splitter's train output as base
      target: Synthesizer   # Compare with Synthesizer's output
...

Custom Comparison Method

---
Loader:
  load_original:
    filepath: benchmark://adult-income_ori
    schema: benchmark://adult-income_schema
Synthesizer:
  generate_synthetic:
    method: custom_data
    filepath: benchmark://adult-income_syn
    schema: benchmark://adult-income_schema
Describer:
  custom_comparison:
    method: compare           # Explicitly specify compare method
    source:
      base: Loader
      target: Synthesizer
    stats_method:             # Custom statistical methods
      - mean
      - std
      - nunique
      - jsdivergence
    compare_method: diff      # Use difference instead of percentage change
    aggregated_method: mean
    summary_method: mean
...

Main Parameters

  • method (string, optional)

    • Evaluation method
    • default: Automatically determine based on source count (1→describe, 2→compare)
    • describe: Single dataset statistical description
    • compare: Dataset comparison (integrating Stats functionality)
    • Default: default
  • source (string | dict, required)

    • Specify data source(s)
    • Single source: For describe method
    • Two sources: For compare method (must use dictionary format)
    • Available values: Loader, Splitter, Preprocessor, Synthesizer, Postprocessor, Constrainer

Supported Methods

MethodDescriptionData RequirementsOutput Content
defaultAuto-detect modeBased on source countBased on detection result
describeSingle dataset statisticsOne data sourceglobal, columnwise, pairwise
compareDataset comparison analysisTwo data sourcesglobal (with Score), columnwise

Parameter Details

Common Parameters

ParameterTypeRequired/OptionalDefaultDescriptionExample
methodstringOptionaldefaultEvaluation methoddescribe, compare
sourcestring|dictRequiredNoneData source module(s)See below

Source Parameter Formats

1. Single source (describe method)

source: Loader

2. Dictionary format (compare method - required)

source:
  base: Splitter.train    # Explicitly specify base data
  target: Synthesizer      # Explicitly specify target for comparison

Note: Backward compatibility supports ori/syn key names, but base/target is recommended.

Compare Method Specific Parameters

ParameterTypeDefaultDescriptionAvailable Values
stats_methodlistAll methodsStatistical methods listmean, std, median, min, max, nunique, jsdivergence
compare_methodstringpct_changeComparison methodpct_change, diff
aggregated_methodstringmeanAggregation methodmean
summary_methodstringmeanSummary methodmean

Statistical Methods Explanation

MethodApplicable Data TypeDescriptionExecution Level
meanNumericMean valuecolumnwise
stdNumericStandard deviationcolumnwise
medianNumericMedian valuecolumnwise
minNumericMinimum valuecolumnwise
maxNumericMaximum valuecolumnwise
nuniqueCategoricalNumber of unique valuescolumnwise
jsdivergenceCategoricalJS divergencepercolumn

Comparison Methods Explanation

MethodFormulaUse Case
pct_change(target - base) / abs(base)View relative change magnitude
difftarget - baseView absolute change amount

Execution Notes

  • source parameter is required, must explicitly specify data source(s)
  • method parameter can be omitted, defaults to default (auto-detection)
  • Statistical methods automatically filtered based on data types

Important Notes

  • source is a required parameter: Must explicitly specify data source(s) to analyze
  • compare mode requires dictionary format: Must explicitly specify base and target keys
  • Backward compatibility: Still supports ori/syn parameter names, but base/target recommended
  • compare method integrates the original Stats evaluator functionality
  • Inapplicable statistical methods will return NaN
  • Recommended for numeric data: mean, std, median, min, max
  • Recommended for categorical data: nunique, jsdivergence

Related Documentation

  • Data Sources: Any data-producing module can be used as source, such as Loader, Splitter, Synthesizer, etc.
  • Module.key Format: Use dot notation to precisely specify when modules have multiple outputs, e.g., Splitter.train
  • Statistical Methods: Automatically determines applicable statistical methods based on data types