So far, I’ve talked about regression or classification algorithms that can be used to solve problems.  Sometimes though, we just want to discover some associations within our data.  These associations can, in turn, be used by a business to optimize profits.

One of the fundamental algorithms that can be used to solve these kind of problems is called Apriori algorithm.

What is Apriori algorithm?

Apriori is a association rule based algorithm that determines to find associations within a dataset.  These associations allows the user to gain insight on their data.  This works especially well when no there is no obvious relationship between the data.

How does Apriori work?

Apriori works best when working with transaction data, i.e. purchases made at a supermarket.  To make the algorithm effective, we define two thresholds:

  1. How frequent the pattern occurs.
  2. How often is the pattern true.

Without the thresholds, you’ll end up with too many associations, most of them being not important.

The algorithm is split into two steps.

The first step is determining how often a pattern appears in the dataset.  The first threshold used determines how frequent the pattern occurs within the dataset.  Using the supermarket scenario, suppose we have several transactions and the support needed for a transaction to be considered as an association is 40 %:

{bread, butter}
{beef, fries, ketchup}
{bread, beef, lettuce, onion}
{beer, fries, onion, beef}
{bread, milk, beer}
{butter, bread, milk, beer}
{beef, bread, fries, beer}

We would start out by evaluating each item individually to determine the frequency the occur.  With this evaluation, we get the following values:

  • bread: 5/7
  • beer: 4/7
  • beef: 4/7
  • fries: 3/7
  • butter: 2/7
  • milk: 2/7
  • onion: 2/7
  • ketchup: 1/7
  • lettuce: 1/7

So, we know that bread, beer, and beef have some significance.  We then continue on an evaluate with sets containing beer, beef, and bread.  So,

  • {bread, beer}: 3/7
  • {bread, beef}: 2/7
  • {bread, fries}: 1/7
  • {beer, beef}: 2/7
  • {beer, fries}: 2/7
  • {beef, fries}: 3/7

We now know that {bread, beer} and {beef, fries} has some significance.  We would continue on until there are no more sets that pass above the threshold.

Now that we have a list of frequent items, we need to form the associations.  As a general rule, the second threshold, confidence, is higher than the support.  Confidence is defined as:

conf(X,Y) = \frac{support(X,Y)}{support(X)}

To start out the possible list of rules:

  • {bread} -> {beer} : 3/5
  • {bread} -> {beef} : 2/5
  • {beer} -> {bread} : 3/4

Once a confidence threshold is applied, the association rules for the dataset are established.

Pros and Cons

Whereas many machine learning algorithms (logistic regression, neural networks, etc.) determine the most likely value or class with little explanation, Apriori makes suggestions on possible associations to look into.  These associations can help optimize models and provide insight on why these behaviors exist.  Additionally, the concept of Apriori is easy to understand since we only want to form rules based on frequent occurrences.

However, Apriori doesn’t work well when there too many subsets are formed from generating the most frequent patterns.  Since Apriori is a breadth-first search algorithm, too many unique items will slow down Apriori.  The search space would then be unreasonable to compute.

Conclusions

Unlike other machine learning algorithms, Apriori is primarily geared towards generating the most frequent associations.  Apriori also works best on data with transactions.  While Apriori is slow on very large datasets, it can efficiently narrow possible rules and provide insight on behaviors.

Have any questions, comments, or spotted an error?  Leave a comment down below and I’ll get back to you ASAP.