Status Tracking

Status is responsible for tracking the execution status of the entire workflow, providing complete execution history, result storage, and Schema metadata management.

Overview

Status is a state management system used internally by Executor and does not require configuration in YAML. It automatically:

Records execution results of each module
Tracks changes in metadata (Schema)
Creates execution snapshots for recovery
Collects execution time information

Automatic Status Tracking

When Executor runs, Status automatically records:

1. Execution Results

After each module executes, results are automatically saved to Status:

from petsard import Executor

exec = Executor(config='config.yaml')
exec.run()

# Get results through Executor
results = exec.get_result()
print(results)

2. Metadata Tracking

Status tracks Schema changes of data across modules:

# config.yaml
Loader:
  load_data:
    filepath: data.csv
    
Preprocessor:
  preprocess:
    method: default
    
Synthesizer:
  generate:
    method: sdv

During execution, Status records:

Original Schema after Loader loading
Schema changes after Preprocessor processing
Schema information available to Synthesizer

3. Execution Snapshots

Snapshots are created before and after each module execution, containing:

Execution timestamp
Module name and experiment name
Metadata state (before/after execution)
Execution context information

4. Time Recording

Automatically collects execution time for each module and step:

from petsard import Executor

exec = Executor(config='config.yaml')
exec.run()

# Get time report
timing = exec.get_timing()
print(timing)

Accessing Status Through Executor

All Status information can be accessed through Executor methods:

Get Execution Results

# Get all results
results = exec.get_result()

# Result format
# {
#   'Loader[load_data]_Synthesizer[generate]': {
#     'data': DataFrame,
#     'schema': Schema
#   }
# }

Get Execution Time

# Get time information
timing_df = exec.get_timing()

# DataFrame columns:
# - record_id: Record ID
# - module_name: Module name
# - experiment_name: Experiment name
# - step_name: Execution step
# - start_time: Start time
# - end_time: End time
# - duration_seconds: Execution time (seconds)

Check Execution Status

# Check if execution completed
if exec.is_execution_completed():
    print("Execution completed")
    results = exec.get_result()
else:
    print("Execution in progress or not started")

Multi-Experiment Result Management

When configuration contains multiple experiments, Status manages results for all combinations:

Configuration Example

Loader:
  load_v1:
    filepath: data_v1.csv
  load_v2:
    filepath: data_v2.csv

Synthesizer:
  method_a:
    method: sdv
    model: GaussianCopula
  method_b:
    method: sdv
    model: CTGAN

Reporter:
  save_all:
    method: save_data
    source: Synthesizer

Result Organization

results = exec.get_result()

# Results contain all experiment combinations:
# {
#   'Loader[load_v1]_Synthesizer[method_a]_Reporter[save_all]': {...},
#   'Loader[load_v1]_Synthesizer[method_b]_Reporter[save_all]': {...},
#   'Loader[load_v2]_Synthesizer[method_a]_Reporter[save_all]': {...},
#   'Loader[load_v2]_Synthesizer[method_b]_Reporter[save_all]': {...}
# }

Schema Inference and Tracking

Status provides Schema inference functionality, especially when using Preprocessor:

Automatic Schema Inference

Loader:
  load_data:
    filepath: data.csv

Preprocessor:
  preprocess:
    scaling:
      age: minmax
      income: standard
    encoding:
      education: onehot

Synthesizer:
  generate:
    method: sdv

Execution flow:

After Loader loads data, Status records original Schema
Executor infers processed Schema based on Preprocessor configuration
Synthesizer uses inferred Schema for synthesis
All Schema changes throughout the process are tracked and recorded

Execution Snapshots

Status creates multiple snapshots during execution:

Snapshot Contents

Each snapshot contains:

Snapshot ID: Unique identifier
Module Information: Module name and experiment name
Timestamp: Creation time
Metadata: Schema state before and after execution
Execution Context: Related configuration and parameters

Snapshot Purposes

Debugging: Check state changes during execution
Auditing: Track complete history of data processing
Recovery: Restore to specific state when needed

Change Tracking

Status records all metadata changes:

Tracking Contents

Change Type: Create, update, delete
Change Target: Schema or Field level
Before and After: State comparison
Change Time: Occurrence timestamp
Module Context: Which module caused the change

Change Example

Loader → Preprocessor:
- age: numerical → numerical (minmax scaled)
- education: categorical → categorical (onehot encoded)
- income: numerical → numerical (standard scaled)

Status Summary

Get complete summary of execution status:

# Direct access through Python API (advanced usage)
from petsard import Executor

exec = Executor(config='config.yaml')
exec.run()

# Get status summary
summary = exec.status.get_status_summary()

# Summary includes:
# - sequence: Module execution sequence
# - active_modules: Executed modules
# - metadata_modules: Modules with metadata
# - total_snapshots: Total snapshot count
# - total_changes: Total change record count
# - last_snapshot: Latest snapshot ID
# - last_change: Latest change ID

Notes

Automatic Management: Status is fully managed automatically by Executor; no YAML configuration needed
Result Access: Use exec.get_result() and exec.get_timing() to get status information
Memory Usage: Long-running workflows accumulate more snapshots; Status automatically manages memory
Snapshot Count: Each module execution generates one snapshot; large experiment combinations produce corresponding number of snapshots
Advanced Features: For complete Status API, refer to Python API documentation

Config Configuration