In this week’s dataset, I worked with the Belgium retail market dataset.  In my previous post, I talked about how Apriori can be used to generate association rules.  So, I search for a good dataset that I can use to apply the Apriori algorithm.  The dataset consists of over 88,000 transactions with over 16,000 different items.  While the dataset only contains numbers, we can still apply the algorithm.  This analysis demonstrates how support and confidence influences the amount of rules generated.

While working on the algorithm, I didn’t realize that many of the items only appear 1% of the time in all transactions.  Had I set the support threshold too low, the algorithm would take a very long time to execute due to the number of subsets we would have to evaluate.  As a result, the lowest support threshold I had was 3%.

The issue that I ran into demonstrates that applying Apriori to datasets with a very large number of items dramatically increases the search space.  In addition, it becomes very hard to yield any useful information when there are too many rules being generated.  When working on these kind of problems, consider the following:

  • What would be an optimal support and confidence threshold?
  • Can I use a cluster to break down the amount of work?
  • Which generated rules are worth further researching?

The PDF of the analysis can be found here.  The notebook can be found on github.

For those wanting to experiment on the dataset, the link can be found here.  How would you approach this problem?

Have any questions or comments?  Feel free to leave a comment down below and I’ll get reply back to you ASAP.