Metadater API (WIP)

Metadater API (WIP)

Data structure metadata manager, providing metadata definition, comparison, and alignment functionality for datasets.

Class Architecture

classDiagram
    class Metadater {
        <<main>>
        +from_data(data: dict) Metadata
        +from_dict(config: dict) Metadata
        +diff(metadata: Metadata, data: dict) dict
        +align(metadata: Metadata, data: dict) dict
        +get(metadata: Metadata, name: str) Schema
        +add(metadata: Metadata, schema: Schema) Metadata
        +update(metadata: Metadata, schema: Schema) Metadata
        +remove(metadata: Metadata, name: str) Metadata
    }

    class SchemaMetadater {
        <<operation>>
        +from_data(data: DataFrame) Schema
        +from_dict(config: dict) Schema
        +diff(schema: Schema, data: DataFrame) dict
        +align(schema: Schema, data: DataFrame) DataFrame
        +get(schema: Schema, name: str) Attribute
        +add(schema: Schema, attribute: Attribute) Schema
        +update(schema: Schema, attribute: Attribute) Schema
        +remove(schema: Schema, name: str) Schema
    }

    class AttributeMetadater {
        <<operation>>
        +from_data(data: Series) Attribute
        +from_dict(config: dict) Attribute
        +diff(attribute: Attribute, data: Series) dict
        +align(attribute: Attribute, data: Series) Series
        +validate(attribute: Attribute, data: Series) tuple
    }

    class Metadata {
        <<dataclass>>
        +id: str
        +schemas: dict[str, Schema]
    }

    class Schema {
        <<dataclass>>
        +id: str
        +attributes: dict[str, Attribute]
    }

    class Attribute {
        <<dataclass>>
        +name: str
        +type: str
        +nullable: bool
        +logical_type: str
    }

    %% 操作關係
    Metadater ..> Metadata : creates/operates
    SchemaMetadater ..> Schema : creates/operates
    AttributeMetadater ..> Attribute : creates/operates
    
    %% 組合關係
    Metadata *-- Schema : contains
    Schema *-- Attribute : contains
    
    %% 階層呼叫
    Metadater --> SchemaMetadater : calls
    SchemaMetadater --> AttributeMetadater : calls

    %% 樣式標示
    style Metadater fill:#e6f3ff,stroke:#4a90e2,stroke-width:3px
    style SchemaMetadater fill:#fff2e6,stroke:#ff9800,stroke-width:2px
    style AttributeMetadater fill:#fff2e6,stroke:#ff9800,stroke-width:2px
    style Metadata fill:#f0f8ff,stroke:#6495ed,stroke-width:2px
    style Schema fill:#f0f8ff,stroke:#6495ed,stroke-width:2px
    style Attribute fill:#f0f8ff,stroke:#6495ed,stroke-width:2px

Legend:

  • Blue boxes: Main operation classes
  • Orange boxes: Operation subclasses
  • Light blue boxes: Data configuration classes
  • ..>: Create/operate relationship
  • *--: Composition relationship
  • -->: Call relationship

Basic Usage

Metadater is primarily used as an internal component, typically accessed through Loader’s schema parameter:

# Defined in YAML
Loader:
  my_experiment:
    filepath: data/users.csv
    schema: schemas/user_schema.yaml

For direct use of Metadater class methods:

from petsard.metadater import Metadater
import pandas as pd

# Automatically infer structure from data
data = {'users': pd.DataFrame(...)}
metadata = Metadater.from_data(data)

# Create metadata from dictionary
config = {'tables': {...}}
metadata = Metadater.from_dict(config)

# Compare data differences
diff = Metadater.diff(metadata, new_data)

# Align data structure
aligned = Metadater.align(metadata, new_data)

Class Method Description

Metadater provides static class methods (@classmethod or @staticmethod) that can be used without instantiation:

Creating Metadata

  • from_data(): Automatically infer and create Metadata from data
  • from_dict(): Create Metadata from configuration dictionary

Comparison and Alignment

  • diff(): Compare differences between Metadata and actual data
  • align(): Align data structure according to Metadata

Data Structure

Metadata

Top level, manages entire dataset:

  • id: Dataset identifier
  • name: Dataset name (optional)
  • description: Dataset description (optional)
  • schemas: Table structure dictionary {table_name: Schema}

Schema

Middle level, describes single table:

  • id: Table identifier
  • name: Table name (optional)
  • description: Table description (optional)
  • attributes: Field attribute dictionary {field_name: Attribute}

Attribute

Bottom level, defines single field:

  • name: Field name
  • type: Data type (int, float, str, bool, datetime, etc.)
  • nullable: Whether null values are allowed (True/False)
  • logical_type: Logical type (optional, e.g., email, phone, url, etc.)
  • na_values: Custom null value representations (optional)

Use Cases

1. Schema Management During Data Loading

Loader internally uses Metadater to handle schema:

# Loader internal process (simplified)
schema = Metadater.from_dict(schema_config)  # Load from YAML
data = pd.read_csv(filepath)                  # Read data
aligned_data = Metadater.align(schema, data)  # Align data structure

2. Data Structure Validation

Compare expected structure with actual data:

# Define expected schema
expected_schema = Metadater.from_dict(config)

# Compare actual data
diff = Metadater.diff(expected_schema, {'users': actual_data})

if diff:
    print("Structure differences found:", diff)

3. Unifying Multiple Dataset Structures

Ensure multiple datasets have the same structure:

# Define standard structure
standard_schema = Metadater.from_data({'users': reference_data})

# Align other datasets
aligned_data1 = Metadater.align(standard_schema, {'users': data1})
aligned_data2 = Metadater.align(standard_schema, {'users': data2})

Notes

  • Primarily Internal Use: Metadater is mainly for internal PETsARD module use; general users can access it through Loader’s schema parameter
  • Class Method Design: All methods are class methods and don’t require Metadater instantiation
  • Auto-Inference: from_data() automatically infers field types and nullability
  • Alignment Behavior: align() adjusts field order, supplements missing fields, and converts data types according to schema
  • Difference Detection: diff() detects differences in field names, types, null value handling, etc.
  • YAML Configuration: For detailed Schema YAML configuration, see Schema YAML Documentation
  • Documentation Notice: This documentation is for internal development team reference only and does not guarantee backward compatibility