Schema YAML

Schema YAML

YAML configuration format for data structure definition.

Usage Examples

External File Reference

Loader:
  my_experiment:
    filepath: data/users.csv
    schema: schemas/user_schema.yaml  # Reference external file

Inline Definition

Loader:
  my_experiment:
    filepath: data/users.csv
    schema:                   # Inline schema definition
      id: user_data
      attributes:             # Field definitions (also can be written as fields)
        user_id:
          type: int64
          enable_null: false
        username:
          type: string
          enable_null: true

Automatic Inference

If no schema is provided, the system will automatically infer structure from data:

Loader:
  auto_infer:
    filepath: data/auto.csv
    # No schema specified, will be inferred

Main Structure

id: <schema_id>           # Required: Schema identifier
attributes:               # Required: Attribute definitions (also can be written as fields)
  <attribute_name>:       # Field name as key
    type: <data_type>     # Required: Data type
    enable_null: <bool>   # Optional: Allow null values (default: true)
    logical_type: <type>  # Optional: Logical type hint
ℹ️
attributes can also be written as fields.

Attribute Parameter List

Required Parameters

ParameterTypeDescriptionExample
namestringField name (automatically set when used as key)"user_id", "age"

Optional Parameters

ParameterTypeDefaultDescriptionExample
typestringnullData type, auto-inferred if not specified"int64", "string", "float64"
enable_nullbooleantrueAllow null valuestrue, false
categorybooleannullWhether it’s categorical datatrue, false
logical_typestringnullLogical type annotation for validation"email", "url", "phone"
descriptionstringnullField description text"User unique identifier"
type_attrdictnullAdditional type attributes (precision, format, etc.){"precision": 2}, {"format": "%Y-%m-%d"}
na_valueslistnullCustom missing value markers["?", "N/A", "unknown"]
default_valueanynullDefault fill value0, "Unknown", false
constraintsdictnullField constraint conditions{"min": 0, "max": 100}
enable_optimize_typebooleantrueEnable type optimizationtrue, false
enable_statsbooleantrueCalculate statisticstrue, false
cast_errorsstring"coerce"Type conversion error handling"raise", "coerce", "ignore"
null_strategystring"keep"Null value handling strategy"keep", "drop", "fill"

System Auto-Generated Parameters

ParameterTypeDescription
statsFieldStatsField statistics (auto-calculated when enable_stats=True)
created_atdatetimeCreation timestamp (auto-recorded by system)
updated_atdatetimeUpdate timestamp (auto-recorded by system)
ℹ️

Auto-Inference Mechanism:

  • When using Metadater.from_data(), parameters like type, logical_type, enable_null are automatically inferred from data
  • When manually creating Schema, only name is required, all other parameters are optional
  • Explicitly specifying type is recommended to ensure data processing accuracy

Advanced Usage

Reusing Schema Across Tables

Loader:
  train_data:
    filepath: data/train.csv
    schema: schemas/common_schema.yaml
    
  test_data:
    filepath: data/test.csv
    schema: schemas/common_schema.yaml

Partial Definition

Define only key fields, others will be inferred:

schema:
  id: partial_schema
  attributes:
    primary_key:
      type: int64
      enable_null: false
    # Other fields will be inferred

Statistics

When using Metadater.from_data() with enable_stats=True, the system automatically calculates statistics.

Field Statistics Example

attributes:
  age:
    type: int64
    enable_null: true
    stats:
      row_count: 1000
      na_count: 50
      unique_count: 65
      mean: 35.5
      median: 34.0

Programmatic Access

from petsard.metadater import Metadater
import pandas as pd

# Create with statistics
data = {'users': pd.DataFrame({...})}
metadata = Metadater.from_data(
    data=data,
    enable_stats=True
)

# Access statistics
schema = metadata.schemas["users"]
age_attr = schema.attributes["age"]
print(f"Average age: {age_attr.stats.mean}")

Related Documentation

  • Data Types: See Data Types for details
  • Logical Types: See Logical Types for details
  • Architecture: Schema uses a three-layer architecture design, see Schema Architecture for details
  • Data Alignment: Schema can be used for data alignment and validation, see Metadater API documentation
  • Loader Integration: How Schema is used during data loading, see Loader YAML documentation
  • Reporter Output: Use Reporter’s save_schema method to export schema from each module, see Reporter - Save Schema for details

Important Notes

  • Field order does not affect data loading
  • Missing fields in data will be filled with default values (enable_null=true)
  • Extra fields in data will be retained
  • The system will attempt automatic type conversion for compatible types
  • attributes can also be written as fields
  • Logical types are only for validation, do not change storage format
  • Statistics calculation increases processing time, use carefully with large datasets