Tabular Playground Series - July 2022
The metric used for evaluation : Rand index
Using a pair (point, cluster) $Rand \sim \frac{TP + TN}{TP+TN+FP+FN}$
Identifying the optimal number of clusters
Elbow Method
This method uses a clustering method, then proceeds to calculate the intra-cluster variation between the points. At some point, adding more clusters does not improve clustering significantly, this is shown in the following figure:
Extracted from here
At $k=3$, we see that the slope changes because adding more clusters does not help with clustering.
sklearn.preprocessing.RobustScaler¶
From scikit-learn documentation:
“Standardization of a dataset is a common requirement for many machine learning estimators. Typically this is done by removing the mean and scaling to unit variance. However, outliers can often influence the sample mean / variance in a negative way. In such cases, the median and the interquartile range often give better results.”
It’s a normalization as usual but instead of taking for percentil 0 to 100, we take a smaller range: default 25 to 75.