Metadater API
Data structure metadata manager, providing metadata definition, inference, comparison, and alignment functionality for datasets.
Module Overview
Metadater uses a three-tier architecture:
Configuration Classes
Define static configuration of data structures:
Metadata: Dataset-level configurationSchema: Table-level configurationAttribute: Field-level configuration
Operation Classes
Provide class methods to operate on configuration:
Metadater: Multi-table operationsSchemaMetadater: Single-table operationsAttributeMetadater: Single-field operations
Data Abstraction Classes
High-level abstractions combining data with configuration:
Datasets: Multi-table dataset (data + Metadata)Table: Single table (data + Schema)Field: Single field (data + Attribute)
Schema Inference Tools
SchemaInferencer: Infer Schema changes after Processor transformationsProcessorTransformRules: Define transformation rulesTransformRule: Data class for single transformation rule
Basic Usage
Through Loader (Recommended)
# Defined in YAML
Loader:
my_experiment:
filepath: data/users.csv
schema: schemas/user_schema.yamlDirect Metadater Usage
from petsard.metadater import Metadater
import pandas as pd
# Infer from data
data = {'users': pd.DataFrame(...)}
metadata = Metadater.from_data(data)
# Create from dictionary
config = {'schemas': {'users': {...}}}
metadata = Metadater.from_dict(config)
# Compare differences
diff = Metadater.diff(metadata, new_data)
# Align data
aligned = Metadater.align(metadata, new_data)Configuration Classes
Metadata
Dataset-level configuration:
from petsard.metadater import Metadata, Schema
metadata = Metadata(
id="my_dataset",
schemas={'users': Schema(...)}
)Main Properties:
id: Dataset identifierschemas: Table structure dictionary{table_name: Schema}enable_stats: Whether to enable statisticsstats: Dataset statistics (DatasetsStats)
Schema
Table-level configuration:
from petsard.metadater import Schema, Attribute
schema = Schema(
id="users",
attributes={
'user_id': Attribute(name='user_id', type='int'),
'email': Attribute(name='email', type='str'),
}
)Main Properties:
id: Table identifierattributes: Field attribute dictionary{field_name: Attribute}primary_key: Primary key field listenable_stats: Whether to enable statisticsstats: Table statistics (TableStats)
Attribute
Field-level configuration:
from petsard.metadater import Attribute
attribute = Attribute(
name="age",
type="int",
type_attr={
"nullable": True,
"category": False,
}
)Main Properties:
name: Field nametype: Data type (int,float,str,date,datetime)type_attr: Type attribute dictionarynullable: Whether null values are allowedcategory: Whether it’s categorical dataprecision: Numeric precisionformat: Datetime formatwidth: String width
logical_type: Logical type (email,phone,url, etc.)enable_stats: Whether to enable statisticsis_constant: Field with all identical values
Operation Classes
Metadater
Class methods for multi-table operations:
Creation Methods
from_data(data, enable_stats=False): Infer and create Metadata from datafrom_dict(config): Create Metadata from configuration dictionaryfrom_metadata(metadata): Copy Metadata
Operation Methods
diff(metadata, data): Compare differencesalign(metadata, data, strategy=None): Align dataget(metadata, name): Get specified Schemaadd(metadata, schema): Add Schemaupdate(metadata, schema): Update Schemaremove(metadata, name): Remove Schema
SchemaMetadater
Class methods for single-table operations:
Creation Methods
from_data(data, enable_stats=False, base_schema=None): Create Schema from DataFramefrom_dict(config): Create Schema from configuration dictionaryfrom_yaml(filepath): Load Schema from YAML filefrom_metadata(schema): Copy Schema
Operation Methods
diff(schema, data): Compare differencesalign(schema, data, strategy=None): Align dataget(schema, name): Get Attributeadd(schema, attribute): Add Attributeupdate(schema, attribute): Update Attributeremove(schema, name): Remove Attribute
AttributeMetadater
Class methods for single-field operations:
Creation Methods
from_data(data, enable_stats=True, base_attribute=None): Create Attribute from Seriesfrom_dict(config): Create Attribute from configuration dictionaryfrom_metadata(attribute): Copy Attribute
Operation Methods
diff(attribute, data): Compare differencesalign(attribute, data, strategy=None): Align datavalidate(attribute, data): Validate datacast(attribute, data): Convert data type
Data Abstraction Classes
Datasets
Multi-table dataset abstraction:
from petsard.metadater import Datasets
datasets = Datasets.create(
data={'users': df},
metadata=metadata
)
# Basic operations
table = datasets.get_table('users')
is_valid, errors = datasets.validate()
aligned_data = datasets.align()Main Properties:
table_count: Number of tablestable_names: List of table names
Main Methods:
get_table(name): Get tableget_tables(): Get all tablesvalidate(): Validate dataalign(): Align datadiff(): Compare differences
Table
Single table abstraction:
from petsard.metadater import Table
table = Table.create(data=df, schema=schema)
# Basic operations
field = table.get_field('age')
is_valid, errors = table.validate()Main Properties:
row_count: Number of rowscolumn_count: Number of columnscolumns: Column names
Main Methods:
get_field(name): Get fieldget_fields(): Get all fieldsvalidate(): Validate dataalign(): Align data
Field
Single field abstraction:
from petsard.metadater import Field
field = Field.create(data=series, attribute=attribute)
# Basic information
print(field.dtype, field.null_count, field.unique_count)Main Properties:
name: Field namedtype: Data typeexpected_type: Expected typenull_count: Number of null valuesunique_count: Number of unique values
Main Methods:
is_valid: Validation statusget_validation_errors(): Get errorsalign(): Align data
Schema Inference Tools
SchemaInferencer
Infer Schema changes after Processor transformations:
from petsard.metadater import SchemaInferencer
inferencer = SchemaInferencer()
# Infer Preprocessor output
output_schema = inferencer.infer_preprocessor_output(
input_schema=loader_schema,
processor_config=preprocessor_config
)
# Infer pipeline Schema changes
pipeline_schemas = inferencer.infer_pipeline_schemas(
loader_schema=loader_schema,
pipeline_config=pipeline_config
)ProcessorTransformRules
Define Processor transformation rules:
from petsard.metadater import ProcessorTransformRules
# Get transformation rule
rule = ProcessorTransformRules.get_rule('encoder_label')
# Apply rule
transformed_attr = ProcessorTransformRules.apply_rule(attribute, rule)Type System
Basic Types
int: Integerfloat: Floatstr: Stringdate: Datedatetime: Datetime
Logical Types
Optional semantic types:
email,phone,urlencoded_categorical,onehot_encodedstandardized,normalized
Type Attributes
type_attr contains additional type information:
nullable: Whether null values are allowedcategory: Whether it’s categorical dataprecision: Numeric precision (decimal places)format: Datetime formatwidth: String width (leading zeros)
Notes
- Primarily Internal Use: Mainly for internal PETsARD module use; general users access through Loader
- Class Method Design: All methods are class methods and don’t require instantiation
- Immutable Design: Configuration objects return new objects when modified
- Auto-Inference:
from_data()automatically infers types, nulls, and statistics - Statistics: Set
enable_stats=Trueto enable detailed statistics