Regression Task

Evaluate synthetic data utility for regression problems.

Usage Examples

Click the below button to run this example in Colab:

Open In Colab

Splitter:
  external_split:
    method: custom_data
    filepath:
      ori: benchmark://adult-income_ori
      control: benchmark://adult-income_control
    schema:
      ori: benchmark://adult-income_schema
      control: benchmark://adult-income_schema
Synthesizer:
  external_data:
    method: custom_data
    filepath: benchmark://adult-income_syn
    schema: benchmark://adult-income_schema
Evaluator:
  regression_utility:
    method: mlutility
    task_type: regression
    target: capital-gain                 # Target column (required)
    experiment_design: domain_transfer   # Experiment design (default: domain_transfer)
    metrics:                             # Evaluation metrics
      - r2_score
      - rmse
    random_state: 42                     # Random seed (default: 42)
    xgb_params:                          # XGBoost parameters (omit if not needed)
      n_estimators: 200                  # Number of trees (default: 100)
      max_depth: 8                       # Maximum tree depth (default: 6)

Task-Specific Parameters

ParameterTypeDefaultDescription
targetstringRequiredTarget variable column name for regression
metricsarraySee belowEvaluation metrics to calculate
xgb_paramsdictNoneXGBoost hyperparameters (omit for defaults)

Default Metrics

  • r2_score, rmse

XGBoost Parameters

ParameterDefaultDescription
n_estimators100Number of boosting rounds (trees)
max_depth6Maximum tree depth
learning_rate0.3Learning rate (eta)
subsample1.0Sample ratio per tree
colsample_bytree1.0Feature ratio per tree
min_child_weight1Minimum sum of instance weight in a child
ℹ️
If you don’t need to adjust XGBoost parameters, you can omit the entire xgb_params block and the system will use default values.

For detailed parameter descriptions and tuning guidance, please refer to the XGBoost documentation.

Supported Metrics

MetricDescriptionRangeDefaultUse Case
r2_scoreCoefficient of determination-∞ to 1Primary metric for cross-dataset comparison
rmseRoot Mean Squared Error0-∞Understand prediction errors, penalizes large errors
mseMean Squared Error0-∞Convenient for optimization, but units are squared
maeMean Absolute Error0-∞More robust when data has outliers
mapeMean Absolute Percentage Error0-∞For relative errors, avoid when data contains zeros

Key Metrics Recommendations

Primary Evaluation Criteria

MetricExcellentGoodAcceptableNeeds Improvement
≥ 0.9≥ 0.7≥ 0.5< 0.5
RMSE< Y_StdDev×0.3< Y_StdDev×0.5< Y_StdDev×0.7≥ Y_StdDev×0.7

*Y_StdDev: Standard deviation of the target variable (target column)

Synthetic Data Utility Assessment

For dual_model_control design, compare ori vs syn differences:

GradeR² DifferenceRMSE Increase
Excellent< 0.05< 10%
Good< 0.10< 20%
Acceptable< 0.20< 30%
Needs Improvement≥ 0.20≥ 30%
ℹ️

RMSE Interpretation:

  • RMSE absolute values must be interpreted in context of data range
  • RMSE/Y_StdDev ratio (Normalized RMSE) provides a unitless performance metric
  • Example: House price prediction (unit: $10k) RMSE = 10 might be good; Temperature prediction (unit: °C) RMSE = 10 might be poor

Usage Considerations

When to Use Regression

  • Continuous target variable: Price, temperature, score, etc.
  • Numerical predictions needed: Forecasting, estimation
  • All features are numerical: Often indicates regression suitability

Data Preprocessing

The evaluator automatically:

  1. Removes missing values
  2. Encodes categorical variables (OneHotEncoder)
  3. Standardizes numerical features
  4. Standardizes target variable

Model Details

  • Algorithm: XGBoost Regressor
  • Objective: Minimize squared error
  • Feature importance: Available through XGBoost
ℹ️
For highly skewed target distributions, consider log transformation before evaluation.

References

  1. Despotovic, M., Nedic, V., Despotovic, D., & Cvetanovic, S. (2016). Evaluation of empirical models for predicting monthly mean horizontal diffuse solar radiation. Renewable and Sustainable Energy Reviews, 56, 246-260.

  2. Chai, T., & Draxler, R. R. (2014). Root mean square error (RMSE) or mean absolute error (MAE)?–Arguments against avoiding RMSE in the literature. Geoscientific model development, 7(3), 1247-1250.