In my previous post, I mentioned how there are many different algorithms that can be used to cluster a dataset.  One of the most popular clustering methods used is called the k-means clustering algorithm.

What is k-means?

The k-means algorithm is a clustering technique that uses a distance based approach to cluster data points.  The algorithm works best on numerical data since it’s easy to define the mean, or the average, on a set of numbers.  The end result is several clusters containing data points.

How does it work?

Lets define a datapoint to be represented as \bold{x} = (x_1, x_2,\ldots, x_n) and a centroid point as c_k = (c_k^{(1)}, c_k^{(2)}, \ldots, c_k^{(n)}), where k = 1,2,…,m represents some cluster C_k.

The algorithm starts by randomly picking k data points that act as the centroid, or the center of the cluster.  From there, the distance from a data point to the kth centroid is calculated.  Whichever the data point is to the nearest centroid will be labeled as part of the cluster C_k.  Determining the minimal distance to a centroid is defined as follows:

\text{min} \parallel \bold{x} - c_k\parallel_2^2

Where \parallel \bold{x} - c_k\parallel_2^2 = {\sqrt{(x_1 - c_k^{(1)})^2 + (x_1 - c_k^{(2)})^2 + \ldots + (x_n - c_k^{(n)})^2}}^2= (x_1 - c_k^{(1)})^2 + (x_1 - c_k^{(2)})^2 + \ldots + (x_n - c_k^{(n)})^2

After all of the points are assigned to a cluster, we calculate the new centroid of each cluster by taking the average of all the points within the kth cluster.  You then determine which points are closest to the new centroids.  This cycle repeats until the centroids converge.

Running k-means multiple times

If you only ran the algorithm once, chances are the cluster assignment was not represented in the best possible way.  This scenario is known as capturing the local optimum.  Now, why would this be undesirable?

Suppose you wanted to determine whether a server on the network was causing a performance hit.  After some analysis, you determined I/O usage (read, write, and seek) and performance as the best way to measure the efficiency of a server.  After performing k-means clustering once, you end up the following graph.

Red are inefficient systems and green are efficient systems. The stars are the centroids of each cluster.

You notice that there are too many servers are returning false negatives (that is, servers that are failing, but are not being reported as such).  From this analysis, you won’t capture all of the systems that are yielding poor performance.  As a result, the issue would still continue.  In a business perspective, people would stop using the service due to the company’s servers being slow, thus causing a loss of profit.

While companies might not use k-mean clustering to detect inefficient servers, there are many companies, such as Facebook, Netflix, and Amazon, that face this issue on a daily basis.  After all, hundreds of thousands, if not millions, of users access their services simultaneously and tens, if not hundreds, of millions use the services daily.

So, how do you mitigate this issue?  Well, you simply rerun the k-means algorithm multiple times, initializing random centroids on each run.  Yet, without some means of measuring the clusters, it would be difficult to determine the best run.  To resolve this, we must define a distortion function (that is, how balanced is the clustering solution) as follows:

J = \displaystyle\sum_{k = 1}^{m}\displaystyle\sum_{\bold{x} \in C_k}\parallel \bold{x} - c_k\parallel_2^2

After each run, the distortion is calculated and checked against the best run.  If the distortion is lower than the best run, the new best run is set.  Once you run k-means multiple times, the labeled points with the lowest distortion will be used.

Drawbacks

Even though the idea of k-means clustering is simple to understand, there are few issues that need to be addressed.

For one thing, clustering, in general, is an NP-Hard problem.  That is, there is no fast way to determine the most optimal clustering solution.  In the case of k-means, the most popular approach used is called Lloyd’s algorithm.  The running time of this algorithm is O(k*i*n*d), where

  • k is the number of clusters.
  • n is the number of data points.
  • d is the number of dimensions (or features).
  • i is the number of iterations in order to reach convergence.

With several factors, this causes k-means be inefficient when:

  • There are too many clusters.
  • There are too many data points.
  • There are too many dimensions in the data.  This is a common issue in cluster analysis.
  • The amount of iterations needed for convergence is high.

This issue is worsened when running k-means multiple times to determine the least distorted run.

Another drawback of k-means clustering is its limited use with non-numerical data.  As previously mentioned, the clustering algorithm uses the mean of all the points within a cluster to calculate the position of the centroid.  This makes sense when the data is dealing with numbers.  In the case of categorical data, however, there is no obvious way to define the mean for this kind of data.

Finally, since real world applications might work with millions, if not billions, of pieces of data, you might have to only pass through the data once to cluster the data.  K-means would not scale well due to needing all of the data in memory at once and the amount of iterations needed for convergence.  As a result, k-means would not fare well in a streaming context.

Conclusion

K-means clustering is ideal with working with small to medium data sets.  Compared to other clustering techniques, k-means uses the distance between centroids and data points to form clusters.  However, it doesn’t work well with non-numerical data and is slow when needing to determine the optimal solution.  Despite these drawbacks, k-means is a very popular clustering technique that enjoys numerous applications.

Have any questions, comments, or spotted an error?  Leave a comment down below and I’ll get back to you ASAP.