Metadater API (WIP)
Data structure metadata manager, providing metadata definition, comparison, and alignment functionality for datasets.
Class Architecture
classDiagram class Metadater { <<main>> +from_data(data: dict) Metadata +from_dict(config: dict) Metadata +diff(metadata: Metadata, data: dict) dict +align(metadata: Metadata, data: dict) dict +get(metadata: Metadata, name: str) Schema +add(metadata: Metadata, schema: Schema) Metadata +update(metadata: Metadata, schema: Schema) Metadata +remove(metadata: Metadata, name: str) Metadata } class SchemaMetadater { <<operation>> +from_data(data: DataFrame) Schema +from_dict(config: dict) Schema +diff(schema: Schema, data: DataFrame) dict +align(schema: Schema, data: DataFrame) DataFrame +get(schema: Schema, name: str) Attribute +add(schema: Schema, attribute: Attribute) Schema +update(schema: Schema, attribute: Attribute) Schema +remove(schema: Schema, name: str) Schema } class AttributeMetadater { <<operation>> +from_data(data: Series) Attribute +from_dict(config: dict) Attribute +diff(attribute: Attribute, data: Series) dict +align(attribute: Attribute, data: Series) Series +validate(attribute: Attribute, data: Series) tuple } class Metadata { <<dataclass>> +id: str +schemas: dict[str, Schema] } class Schema { <<dataclass>> +id: str +attributes: dict[str, Attribute] } class Attribute { <<dataclass>> +name: str +type: str +nullable: bool +logical_type: str } %% 操作關係 Metadater ..> Metadata : creates/operates SchemaMetadater ..> Schema : creates/operates AttributeMetadater ..> Attribute : creates/operates %% 組合關係 Metadata *-- Schema : contains Schema *-- Attribute : contains %% 階層呼叫 Metadater --> SchemaMetadater : calls SchemaMetadater --> AttributeMetadater : calls %% 樣式標示 style Metadater fill:#e6f3ff,stroke:#4a90e2,stroke-width:3px style SchemaMetadater fill:#fff2e6,stroke:#ff9800,stroke-width:2px style AttributeMetadater fill:#fff2e6,stroke:#ff9800,stroke-width:2px style Metadata fill:#f0f8ff,stroke:#6495ed,stroke-width:2px style Schema fill:#f0f8ff,stroke:#6495ed,stroke-width:2px style Attribute fill:#f0f8ff,stroke:#6495ed,stroke-width:2px
Legend:
- Blue boxes: Main operation classes
- Orange boxes: Operation subclasses
- Light blue boxes: Data configuration classes
..>
: Create/operate relationship*--
: Composition relationship-->
: Call relationship
Basic Usage
Metadater is primarily used as an internal component, typically accessed through Loader’s schema parameter:
# Defined in YAML
Loader:
my_experiment:
filepath: data/users.csv
schema: schemas/user_schema.yaml
For direct use of Metadater class methods:
from petsard.metadater import Metadater
import pandas as pd
# Automatically infer structure from data
data = {'users': pd.DataFrame(...)}
metadata = Metadater.from_data(data)
# Create metadata from dictionary
config = {'tables': {...}}
metadata = Metadater.from_dict(config)
# Compare data differences
diff = Metadater.diff(metadata, new_data)
# Align data structure
aligned = Metadater.align(metadata, new_data)
Class Method Description
Metadater provides static class methods (@classmethod
or @staticmethod
) that can be used without instantiation:
Creating Metadata
from_data()
: Automatically infer and create Metadata from datafrom_dict()
: Create Metadata from configuration dictionary
Comparison and Alignment
diff()
: Compare differences between Metadata and actual dataalign()
: Align data structure according to Metadata
Data Structure
Metadata
Top level, manages entire dataset:
id
: Dataset identifiername
: Dataset name (optional)description
: Dataset description (optional)schemas
: Table structure dictionary{table_name: Schema}
Schema
Middle level, describes single table:
id
: Table identifiername
: Table name (optional)description
: Table description (optional)attributes
: Field attribute dictionary{field_name: Attribute}
Attribute
Bottom level, defines single field:
name
: Field nametype
: Data type (int
,float
,str
,bool
,datetime
, etc.)nullable
: Whether null values are allowed (True
/False
)logical_type
: Logical type (optional, e.g.,email
,phone
,url
, etc.)na_values
: Custom null value representations (optional)
Use Cases
1. Schema Management During Data Loading
Loader internally uses Metadater to handle schema:
# Loader internal process (simplified)
schema = Metadater.from_dict(schema_config) # Load from YAML
data = pd.read_csv(filepath) # Read data
aligned_data = Metadater.align(schema, data) # Align data structure
2. Data Structure Validation
Compare expected structure with actual data:
# Define expected schema
expected_schema = Metadater.from_dict(config)
# Compare actual data
diff = Metadater.diff(expected_schema, {'users': actual_data})
if diff:
print("Structure differences found:", diff)
3. Unifying Multiple Dataset Structures
Ensure multiple datasets have the same structure:
# Define standard structure
standard_schema = Metadater.from_data({'users': reference_data})
# Align other datasets
aligned_data1 = Metadater.align(standard_schema, {'users': data1})
aligned_data2 = Metadater.align(standard_schema, {'users': data2})
Notes
- Primarily Internal Use: Metadater is mainly for internal PETsARD module use; general users can access it through Loader’s
schema
parameter - Class Method Design: All methods are class methods and don’t require Metadater instantiation
- Auto-Inference:
from_data()
automatically infers field types and nullability - Alignment Behavior:
align()
adjusts field order, supplements missing fields, and converts data types according to schema - Difference Detection:
diff()
detects differences in field names, types, null value handling, etc. - YAML Configuration: For detailed Schema YAML configuration, see Schema YAML Documentation
- Documentation Notice: This documentation is for internal development team reference only and does not guarantee backward compatibility