If you were to go online and start shopping, chances are you’re getting plowed by many suggestions from online sites.  However, these suggestions aren’t random, but rather based on what you recently browsed and purchased.  How did they determine what to recommend and what to ignore?

The system described above is called a recommendation system.  The actual implementation, though, is through the use of a method called clustering.  Clustering, in itself, is part of Cluster Analysis.

What is Cluster Analysis?

Cluster Analysis is the task of grouping similar objects together to form clusters.  These clusters can allow insight on data and can aid in providing solutions to business problems.

Unlike many machine learning algorithms, clustering can be be used on data that is not explicitly labeled.  This property allows clustering to be an unsupervised algorithm.

Like Support Vector Machines (SVM), Cluster Analysis is very broad in terms of theory, implementation, and applications.  As a result, there’s literally no way I can cover every aspect of Cluster Analysis in a single post.

Why is Cluster Analysis important?

As humans, we’re good at detecting patterns with information given to us.  However, this only works when there either isn’t too much information being given or the patterns are fairly obvious to detect.  Once information is represented with too many dimensions, or features, it becomes much harder for us to manually detect patterns.  This becomes especially problematic for problems that involve too much information, diverse data formats, and many end-users.  So how do we tackle these issues?

For one thing, we have computers.  Obviously, they can perform calculations many times faster than humans can do.  Yet, computers, alone, are useless without any given instructions.

With Cluster Analysis and the power of computers, we can quickly group data together to gain insight on the dataset.  This also gives us the ability to work with data that contains many features and are represented in different formats.

Approaches to clustering

There are several methods that can be used to cluster data:

  • One can use dimensionality reduction methods to remove unnecessary features from the dataset.  This reduce the complexity of the model, but information is lost.
  • We can also cluster data using probabilistic models.  The Expectation Maximization algorithm is a common method to model data this way.
  • Calculating the distance between a data point and a cluster to determine how to cluster the data.
  • Density and grid based techniques allows data to be explored in great detail.

Commonly workable data types

The following data types are common when working with clustering:

  • Categorical Data – While a common data type in machine learning, a new definition of properties will need to be defined.
  • Text Data – Data that often appear in documents, comments, and posts.  Since individuals words appear in small percentages, it might be very complex to work with.
  • Time Series Data – Data that is dependent to data based on time.
  • Discrete Data – Data that is dependent to data base on earlier placement.
  • Multimedia Data – Increasingly being used for clustering.  Due to the variety of possible data formats, multimedia data is harder to work with.

Applications of Cluster Analysis

While not exhaustive, clustering can be used in different situations:

  • Some methods are used as an intermediate step for another algorithm.
  • Collaborative filtering can be used to cluster people with similar views and ratings.  One approach for recommendation systems.
  • Customer segmentation can be used to cluster people with similar interests.
  • Clustering can be used to represent data for developing insight on a dataset.
  • Clustering also allows users to reduce the complexity of data by reducing the number of features.

Issues with clustering

Despite the variety of approaches and applications, there are a few drawbacks when doing Cluster Analysis:

  • Clustering data is generally an NP-Hard problem.  That is, there is no efficient way to cluster data in a reasonable amount of time.  However, this can be alleviated by the use of heuristics, or rules of thumbs.
  • The more features the dataset contains, the harder it is to effectively cluster data in a meaningful manner.  While dimensionality can reduce the complex, it’s also possible to lose valuable information when gaining insight on the data.
  • There is no one method that can handle all types of data.  Certain pieces of data, such as categorical, are easier to work with than other, like multimedia.  This is due to data types containing unique properties and challenges.

Conclusion

Cluster Analysis is a very broad topic and continued research is being done.  Clustering allows users to gain insight on their data by assigning data points to a few clusters.  As a result, clustering is a popular approach for recommendation systems.  However, it doesn’t scale very well when the number of features taken into account is very high.  Despite the drawbacks, clustering enjoys many applications.

Have any questions, comments, or spotted an error?  Leave a comment down below and I’ll get back to you ASAP.