Schema YAML
Schema YAML
YAML configuration format for data structure definition.
Usage Examples
External File Reference
Loader:
my_experiment:
filepath: data/users.csv
schema: schemas/user_schema.yaml # Reference external fileInline Definition
Loader:
my_experiment:
filepath: data/users.csv
schema: # Inline schema definition
id: user_data
attributes: # Field definitions (also can be written as fields)
user_id:
type: int64
enable_null: false
username:
type: string
enable_null: trueAutomatic Inference
If no schema is provided, the system will automatically infer structure from data:
Loader:
auto_infer:
filepath: data/auto.csv
# No schema specified, will be inferredMain Structure
id: <schema_id> # Required: Schema identifier
attributes: # Required: Attribute definitions (also can be written as fields)
<attribute_name>: # Field name as key
type: <data_type> # Required: Data type
enable_null: <bool> # Optional: Allow null values (default: true)
logical_type: <type> # Optional: Logical type hintℹ️
attributes can also be written as fields.Attribute Parameter List
Required Parameters
| Parameter | Type | Description | Example |
|---|---|---|---|
name | string | Field name (automatically set when used as key) | "user_id", "age" |
Optional Parameters
| Parameter | Type | Default | Description | Example |
|---|---|---|---|---|
type | string | null | Data type, auto-inferred if not specified | "int64", "string", "float64" |
enable_null | boolean | true | Allow null values | true, false |
category | boolean | null | Whether it’s categorical data | true, false |
logical_type | string | null | Logical type annotation for validation | "email", "url", "phone" |
description | string | null | Field description text | "User unique identifier" |
type_attr | dict | null | Additional type attributes (precision, format, etc.) | {"precision": 2}, {"format": "%Y-%m-%d"} |
na_values | list | null | Custom missing value markers | ["?", "N/A", "unknown"] |
default_value | any | null | Default fill value | 0, "Unknown", false |
constraints | dict | null | Field constraint conditions | {"min": 0, "max": 100} |
enable_optimize_type | boolean | true | Enable type optimization | true, false |
enable_stats | boolean | true | Calculate statistics | true, false |
cast_errors | string | "coerce" | Type conversion error handling | "raise", "coerce", "ignore" |
null_strategy | string | "keep" | Null value handling strategy | "keep", "drop", "fill" |
System Auto-Generated Parameters
| Parameter | Type | Description |
|---|---|---|
stats | FieldStats | Field statistics (auto-calculated when enable_stats=True) |
created_at | datetime | Creation timestamp (auto-recorded by system) |
updated_at | datetime | Update timestamp (auto-recorded by system) |
ℹ️
Auto-Inference Mechanism:
- When using
Metadater.from_data(), parameters liketype,logical_type,enable_nullare automatically inferred from data - When manually creating Schema, only
nameis required, all other parameters are optional - Explicitly specifying
typeis recommended to ensure data processing accuracy
Advanced Usage
Reusing Schema Across Tables
Loader:
train_data:
filepath: data/train.csv
schema: schemas/common_schema.yaml
test_data:
filepath: data/test.csv
schema: schemas/common_schema.yamlPartial Definition
Define only key fields, others will be inferred:
schema:
id: partial_schema
attributes:
primary_key:
type: int64
enable_null: false
# Other fields will be inferredStatistics
When using Metadater.from_data() with enable_stats=True, the system automatically calculates statistics.
Field Statistics Example
attributes:
age:
type: int64
enable_null: true
stats:
row_count: 1000
na_count: 50
unique_count: 65
mean: 35.5
median: 34.0Programmatic Access
from petsard.metadater import Metadater
import pandas as pd
# Create with statistics
data = {'users': pd.DataFrame({...})}
metadata = Metadater.from_data(
data=data,
enable_stats=True
)
# Access statistics
schema = metadata.schemas["users"]
age_attr = schema.attributes["age"]
print(f"Average age: {age_attr.stats.mean}")Related Documentation
- Data Types: See Data Types for details
- Logical Types: See Logical Types for details
- Architecture: Schema uses a three-layer architecture design, see Schema Architecture for details
- Data Alignment: Schema can be used for data alignment and validation, see Metadater API documentation
- Loader Integration: How Schema is used during data loading, see Loader YAML documentation
- Reporter Output: Use Reporter’s save_schema method to export schema from each module, see Reporter - Save Schema for details
Important Notes
- Field order does not affect data loading
- Missing fields in data will be filled with default values (enable_null=true)
- Extra fields in data will be retained
- The system will attempt automatic type conversion for compatible types
attributescan also be written asfields- Logical types are only for validation, do not change storage format
- Statistics calculation increases processing time, use carefully with large datasets