benchmark://

Loader supports using the benchmark:// protocol to automatically download and load benchmark datasets.

Usage Examples

Click the below button to run this example in Colab:

Open In Colab

Loading Benchmark Dataset

Loader:
  load_benchmark:
    filepath: benchmark://adult-income

Loading Benchmark Dataset with Benchmark Schema

Loader:
  load_benchmark_with_schema:
    filepath: benchmark://adult-income
    schema: benchmark://adult-income_schema

Either local or benchmark-provided filepath and schema can be used interchangeably.

Available Benchmark Datasets

Demographic Datasets

Dataset NameProtocol PathDescription
Adult Incomebenchmark://adult-incomeUCI Adult Income census dataset (48,842 rows, 15 columns)
Adult Income Schemabenchmark://adult-income_schemaSchema definition for Adult Income dataset
Adult Income (Original)benchmark://adult-income_oriOriginal training data (for demo)
Adult Income (Control)benchmark://adult-income_controlControl group data (for demo)
Adult Income (Synthetic)benchmark://adult-income_synSDV Gaussian Copula synthetic data (for demo)

Best Practices Sample Datasets

Dataset NameProtocol PathDescription
Multi-table Companiesbenchmark://best-practices_multi-table_companiesMulti-table example - Company data
Multi-table Applicationsbenchmark://best-practices_multi-table_applicationsMulti-table example - Application data
Multi-table Trackingbenchmark://best-practices_multi-table_trackingMulti-table example - Tracking data
Multi-timestampbenchmark://best-practices_multi-tableMulti-timestamp example data
Categorical & High-cardinalitybenchmark://best-practices_categorical_high-cardinalityCategorical and high-cardinality example data

How It Works

  1. Protocol Detection: Loader detects benchmark:// protocol
  2. Automatic Download: Downloads dataset from AWS S3 bucket
  3. Integrity Check: Verifies data integrity using SHA256
  4. Local Cache: Data is stored in benchmark/ directory
  5. Data Loading: Loads data using local path

When to Use

Benchmark datasets are suitable for:

  • Testing New Algorithms: Test on data with known characteristics
  • Parameter Tuning: Compare effects of different parameter settings
  • Performance Benchmarking: Compare with academic research results
  • Teaching Demonstrations: Provide standardized example data

Notes

  • First use requires network connection to download data
  • Datasets are cached locally in benchmark/ directory
  • Large dataset downloads may take considerable time
  • Protocol names are case-insensitive (lowercase recommended)
  • All datasets are verified with SHA256 to ensure integrity