benchmark://

Loader supports using the benchmark:// protocol to automatically download and load benchmark datasets.

Usage Examples

Click the below button to run this example in Colab:

Open In Colab

Note: If using Colab, please see the runtime setup guide.

Loading Benchmark Dataset

Loader:
  load_benchmark:
    filepath: benchmark://adult-income

Loading Benchmark Dataset with Benchmark Schema

Loader:
  load_benchmark_with_schema:
    filepath: benchmark://adult-income
    schema: benchmark://adult-income_schema

Either local or benchmark-provided filepath and schema can be used interchangeably.

Available Benchmark Datasets

Demographic Datasets

Dataset NameProtocol PathDescription
Adult Incomebenchmark://adult-incomeUCI Adult Income census dataset (48,842 rows, 15 columns)
Adult Income Schemabenchmark://adult-income_schemaSchema definition for Adult Income dataset
Adult Income (Original)benchmark://adult-income_oriOriginal training data (for demo)
Adult Income (Control)benchmark://adult-income_controlControl group data (for demo)
Adult Income (Synthetic)benchmark://adult-income_synSDV Gaussian Copula synthetic data (for demo)
Taiwan Salary Statisticsbenchmark://taiwan-salary-statistics-300kTaiwan salary statistics dataset (300K records)
Taiwan Salary Statistics (No DI)benchmark://taiwan-salary-statistics-300k-no-diTaiwan salary statistics dataset - No Direct Identification (300K records, with name and ID removed, birth date and address split)

Taiwan Salaries Statistics

This is a simulated dataset created by the InnoServe team in 2024 for challenge questions, simulating the Ministry of Labor’s occupational salary survey statistics.

Description

  • This dataset references the compilation methodology of the “Survey of Employed Workers’ Earnings” conducted monthly by the Directorate-General of Budget, Accounting and Statistics (DGBAS) (link), simulating a wide-table structure that links comprehensive income tax files with labor insurance files, labor pension monthly contribution wage files, and National Health Insurance files
  • The simulation process uses publicly available 2023 aggregate statistics and references multiple government open data sources for numerical simulation. The data content does not involve any real individuals or legal entities. Any similarity to names or company names is purely coincidental
  • This dataset simulates only Taiwanese workers but includes all 20 municipalities and counties nationwide, including Kinmen and Lienchiang

Best Practices Sample Datasets

Dataset NameProtocol PathDescription
Multi-table Companiesbenchmark://best-practices_multi-table_companiesMulti-table example - Company data
Multi-table Applicationsbenchmark://best-practices_multi-table_applicationsMulti-table example - Application data
Multi-table Trackingbenchmark://best-practices_multi-table_trackingMulti-table example - Tracking data
Multi-timestampbenchmark://best-practices_multi-tableMulti-timestamp example data
Categorical & High-cardinalitybenchmark://best-practices_categorical_high-cardinalityCategorical and high-cardinality example data

How It Works

  1. Protocol Detection: Loader detects benchmark:// protocol
  2. Automatic Download: Downloads dataset from AWS S3 bucket
  3. Integrity Check: Verifies data integrity using SHA256
  4. Local Cache: Data is stored in benchmark/ directory
  5. Data Loading: Loads data using local path

When to Use

Benchmark datasets are suitable for:

  • Testing New Algorithms: Test on data with known characteristics
  • Parameter Tuning: Compare effects of different parameter settings
  • Performance Benchmarking: Compare with academic research results
  • Teaching Demonstrations: Provide standardized example data

Notes

  • First use requires network connection to download data
  • Datasets are cached locally in benchmark/ directory
  • Large dataset downloads may take considerable time
  • Protocol names are case-insensitive (lowercase recommended)
  • All datasets are verified with SHA256 to ensure integrity