Clustering Task
Evaluate synthetic data utility for unsupervised clustering problems.
Usage Examples
Click the below button to run this example in Colab:
Splitter:
external_split:
method: custom_data
filepath:
ori: benchmark://adult-income_ori
control: benchmark://adult-income_control
schema:
ori: benchmark://adult-income_schema
control: benchmark://adult-income_schema
Synthesizer:
external_data:
method: custom_data
filepath: benchmark://adult-income_syn
schema: benchmark://adult-income_schema
Evaluator:
clustering_utility:
method: mlutility
task_type: clustering
experiment_design: domain_transfer # Experiment design (default: domain_transfer)
n_clusters: 3 # Number of clusters (default: 3)
metrics: # Evaluation metrics
- silhouette_score
random_state: 42 # Random seed (default: 42)Task-Specific Parameters
| Parameter | Type | Default | Description |
|---|---|---|---|
| n_clusters | integer | 5 | Number of clusters for K-means |
| metrics | array | [silhouette_score] | Evaluation metrics (currently only silhouette_score supported) |
Note: Clustering tasks do not require a target parameter since they are unsupervised.
Supported Metrics
| Metric | Description | Range | Default |
|---|---|---|---|
silhouette_score | Measure of cluster cohesion and separation | -1 to 1 | ✓ |
Key Metrics Recommendations
| Metric | Description | Recommended Standard |
|---|---|---|
| Silhouette Score | Evaluates cluster compactness and separation • 1: Perfect clustering • 0: Overlapping clusters • Negative: Misaligned clusters | ≥ 0.5 |
Usage Considerations
When to Use Clustering
- No target variable available: Exploratory data analysis
- Pattern discovery needed: Customer segmentation, anomaly detection
- All numerical features: Clustering works best with numerical data
Choosing Number of Clusters
Consider these approaches:
- Domain knowledge: Industry standards or business requirements
- Elbow method: Plot inertia vs. k and find the “elbow”
- Silhouette analysis: Test different k values and compare scores
- Gap statistic: Statistical method to estimate optimal k
Data Preprocessing
MLUtility automatically performs the following preprocessing on input data:
Missing Value Handling
- Removes samples containing missing values (using
dropna())
- Removes samples containing missing values (using
Column Type Identification
- Checks all datasets (ori, syn, control)
- If a column is categorical in any dataset, treats it as categorical
- Conservative approach ensures no categorical features are missed
Categorical Feature Encoding
- Uses OneHotEncoder for one-hot encoding
- Trains encoder only on ori and syn data (avoids data leakage)
handle_unknown='ignore': Unseen categories in control encoded as all-zero vectors
Feature Standardization
- Uses StandardScaler on all features (numerical + encoded categorical)
- Computes mean and std only from ori and syn data (avoids data leakage)
- Control dataset uses the same transformation parameters
Data Alignment
- Ensures consistent feature dimensions across all datasets
- Processed data ready for clustering analysis
ℹ️
Data Leakage Prevention: Encoders and scalers are trained only on ori and syn data, preventing control information from leaking into the training process.
Model Details
- Algorithm: K-means clustering
- Distance metric: Euclidean distance
- Initialization: k-means++ (default)
- Maximum iterations: 300
Limitations
⚠️
Current Limitations:
- Only K-means clustering is supported
- Only silhouette score metric is available
- Assumes spherical clusters (K-means assumption)
Interpreting Results
Silhouette Score Interpretation:
- 0.71-1.00: Strong structure found
- 0.51-0.70: Reasonable structure found
- 0.26-0.50: Weak structure, may be artificial
- < 0.25: No substantial structure found
The score measures:
- Cohesion: How close points are to their own cluster
- Separation: How far points are from other clusters
ℹ️
For datasets with non-spherical clusters, consider that K-means may underperform, affecting utility evaluation accuracy.