K-Means divides the dataset into groups based on k (a hyper-parameter) using an iterative optimization process. Each cluster is represented by a center. The member of the cluster whose center is closest to a point is called a point. For simplicity, presume that the centers are initialized at random.
The goal of the model is to identify clusters that move their centers in order to reduce the total SSE over k clusters. The sum of the squared distances between the center and the points in a cluster is known as the SSE, or Sum of Squared Errors.
Calculus may be used to show that the best way to reduce the SSE is to move the cluster center to the centroid (=average) of all the points in the cluster. Using the new centers and the same closest-centre technique as before, we allocate the points to the clusters.
What you need to understand before using KMeans
The majority of machine learning models make assumptions about the training set of data. It is imperative to check these before drawing any conclusions from a trained model. For K-Means, they are as follows:
Non-globular structures cannot be handled by K-Means
Any number of patterns that can be seen visually can be found in datasets. The task of clustering algorithms is to be able to record this data. Different algorithms use different approaches. The centroid is used as a reference (=prototype) for each cluster in K-Means and other prototype-based methods. Based on the density of the data points, DBSCAN and other density-based algorithms create clusters.
K-Means may be able to capture more structural semantics in globular data. This is evident from the data analysis using K-Means Clustering. The following is a recognized practice of K-Means.
Each cluster has a central point. A point belongs to the cluster whose nearest centroid it is. By repeatedly relocating the centroids to their optimal location, K-Means reduces the sum of SSE.
In a sense, K-means functions by creating a sharp boundary between each cluster in the dataset.
KMeans take into account outliers
Because K-Means is a distance-based method, it is susceptible to outliers. At each update step, the centroids are computed by averaging the points in a cluster. We all know that outliers can affect averages.
Should your data be scaled before being processed via KMeans?
Feature scaling techniques like min-max scaling and standard scaling are advantageous for many machine learning algorithms. As the ‘benefit’ is measured, the metric is raised.
In what ways does scale affect KMeans? How can we tell if it will be advantageous? Let’s examine the effects of scaling on our model.
If we have two features, X1 and X2, for example. The ranges of X1 and X2 are -1 to 1 and -100 to 100, respectively. In the computation of the intracluster variances, X2 will contribute more to the SSE than X1. As a result, by decreasing the contribution from X2, the model is able to lower this SSE even more. If you use standard scaling and modify the feature as — this won’t happen.
- X_transformed = (X-X_mean)/X_std_dev
How can the effectiveness of the model be evaluated?
What you can say about a clustering model is constrained by the absence of labels. But it hasn’t entirely disappeared. The quality of cluster formation during clustering is what we are most concerned with. ‘Well’ doesn’t have a precise definition.
The clusters should ideally be wide, large, and dense.
The biggest cluster will result from setting k (the number of clusters) to one. Using a huge k and creating numerous dense micro-clusters will yield the densest clusters. Both times, we lose any valuable information we were hoping to infer from the cluster.
Think about analyzing a dataset with a KMeans model. How do we choose the optimal number of clusters? Remember that KMeans reduces SSE between clusters. As we increased k, our SSE across clusters would decrease. Since each point in the dataset will be its own cluster and centroid, the SSE for all clusters at k=number of points in the dataset will be 0. This is ineffective, though. In order to minimize the loss in SSE from additional incrementation, we only want to raise k to that point.
Let me explain the silhouette score to you. It is a metric that shows how much space there is between clusters. The less overlap, with a number ranging from -1 to 1, the higher the score. It is calculated using two metrics for each point p in the dataset:
The average intra-cluster distance, p, between all points in a cluster is given by the symbol a.
p’s average distance from every cluster aside from the one to which it belongs is given by the formula b. If there are N clusters, then we get N-1 such averages. Take the smallest one and call it b. The silhouette-score for p is equal to (b – a)/max(b, a).
K-Means is a powerful method for exploring a sizable dataset and spotting patterns using euclidean geometry. Make sure you go by the aforementioned guidelines to prevent errors. I sincerely hope you found them useful.
Please click the follow button on the right and join my mailing list if you liked the article. It would be really meaningful to me!