# Tabular Playground Series - July 2022

## The metric used for evaluation : Rand index

Using a pair (point, cluster) $Rand \sim \frac{TP + TN}{TP+TN+FP+FN}$

## Identifying the optimal number of clusters

### Elbow Method

This method uses a clustering method, then proceeds to calculate the intra-cluster variation between the points. At some point, adding more clusters does not improve clustering significantly, this is shown in the following figure:

Extracted from here

At $k=3$, we see that the slope changes because adding more clusters does not help with clustering.

`sklearn.preprocessing.RobustScaler¶`

From scikit-learn documentation:

“Standardization of a dataset is a common requirement for many machine learning estimators. Typically this is done by removing the mean and scaling to unit variance. However, outliers can often influence the sample mean / variance in a negative way. In such cases, the median and the interquartile range often give better results.”

It’s a normalization as usual but instead of taking for percentil 0 to 100, we take a smaller range: default 25 to 75.