Status Tracking
Status is responsible for tracking the execution status of the entire workflow, providing complete execution history, result storage, and Schema metadata management.
Overview
Status is a state management system used internally by Executor and does not require configuration in YAML. It automatically:
- Records execution results of each module
- Tracks changes in metadata (Schema)
- Creates execution snapshots for recovery
- Collects execution time information
Automatic Status Tracking
When Executor runs, Status automatically records:
1. Execution Results
After each module executes, results are automatically saved to Status:
from petsard import Executor
exec = Executor(config='config.yaml')
exec.run()
# Get results through Executor
results = exec.get_result()
print(results)
2. Metadata Tracking
Status tracks Schema changes of data across modules:
# config.yaml
Loader:
load_data:
filepath: data.csv
Preprocessor:
preprocess:
method: default
Synthesizer:
generate:
method: sdv
During execution, Status records:
- Original Schema after Loader loading
- Schema changes after Preprocessor processing
- Schema information available to Synthesizer
3. Execution Snapshots
Snapshots are created before and after each module execution, containing:
- Execution timestamp
- Module name and experiment name
- Metadata state (before/after execution)
- Execution context information
4. Time Recording
Automatically collects execution time for each module and step:
from petsard import Executor
exec = Executor(config='config.yaml')
exec.run()
# Get time report
timing = exec.get_timing()
print(timing)
Accessing Status Through Executor
All Status information can be accessed through Executor methods:
Get Execution Results
# Get all results
results = exec.get_result()
# Result format
# {
# 'Loader[load_data]_Synthesizer[generate]': {
# 'data': DataFrame,
# 'schema': Schema
# }
# }
Get Execution Time
# Get time information
timing_df = exec.get_timing()
# DataFrame columns:
# - record_id: Record ID
# - module_name: Module name
# - experiment_name: Experiment name
# - step_name: Execution step
# - start_time: Start time
# - end_time: End time
# - duration_seconds: Execution time (seconds)
Check Execution Status
# Check if execution completed
if exec.is_execution_completed():
print("Execution completed")
results = exec.get_result()
else:
print("Execution in progress or not started")
Multi-Experiment Result Management
When configuration contains multiple experiments, Status manages results for all combinations:
Configuration Example
Loader:
load_v1:
filepath: data_v1.csv
load_v2:
filepath: data_v2.csv
Synthesizer:
method_a:
method: sdv
model: GaussianCopula
method_b:
method: sdv
model: CTGAN
Reporter:
save_all:
method: save_data
source: Synthesizer
Result Organization
results = exec.get_result()
# Results contain all experiment combinations:
# {
# 'Loader[load_v1]_Synthesizer[method_a]_Reporter[save_all]': {...},
# 'Loader[load_v1]_Synthesizer[method_b]_Reporter[save_all]': {...},
# 'Loader[load_v2]_Synthesizer[method_a]_Reporter[save_all]': {...},
# 'Loader[load_v2]_Synthesizer[method_b]_Reporter[save_all]': {...}
# }
Schema Inference and Tracking
Status provides Schema inference functionality, especially when using Preprocessor:
Automatic Schema Inference
Loader:
load_data:
filepath: data.csv
Preprocessor:
preprocess:
scaling:
age: minmax
income: standard
encoding:
education: onehot
Synthesizer:
generate:
method: sdv
Execution flow:
- After Loader loads data, Status records original Schema
- Executor infers processed Schema based on Preprocessor configuration
- Synthesizer uses inferred Schema for synthesis
- All Schema changes throughout the process are tracked and recorded
Execution Snapshots
Status creates multiple snapshots during execution:
Snapshot Contents
Each snapshot contains:
- Snapshot ID: Unique identifier
- Module Information: Module name and experiment name
- Timestamp: Creation time
- Metadata: Schema state before and after execution
- Execution Context: Related configuration and parameters
Snapshot Purposes
- Debugging: Check state changes during execution
- Auditing: Track complete history of data processing
- Recovery: Restore to specific state when needed
Change Tracking
Status records all metadata changes:
Tracking Contents
- Change Type: Create, update, delete
- Change Target: Schema or Field level
- Before and After: State comparison
- Change Time: Occurrence timestamp
- Module Context: Which module caused the change
Change Example
Loader → Preprocessor:
- age: numerical → numerical (minmax scaled)
- education: categorical → categorical (onehot encoded)
- income: numerical → numerical (standard scaled)
Status Summary
Get complete summary of execution status:
# Direct access through Python API (advanced usage)
from petsard import Executor
exec = Executor(config='config.yaml')
exec.run()
# Get status summary
summary = exec.status.get_status_summary()
# Summary includes:
# - sequence: Module execution sequence
# - active_modules: Executed modules
# - metadata_modules: Modules with metadata
# - total_snapshots: Total snapshot count
# - total_changes: Total change record count
# - last_snapshot: Latest snapshot ID
# - last_change: Latest change ID
Notes
- Automatic Management: Status is fully managed automatically by Executor; no YAML configuration needed
- Result Access: Use
exec.get_result()
andexec.get_timing()
to get status information - Memory Usage: Long-running workflows accumulate more snapshots; Status automatically manages memory
- Snapshot Count: Each module execution generates one snapshot; large experiment combinations produce corresponding number of snapshots
- Advanced Features: For complete Status API, refer to Python API documentation