Algorithm: Bernoulli Naive Bayes

In my post on Naive Bayes, I mentioned that there are multiple variants that can be used towards different problems.  In this post, I will be introducing another variant of Naive Bayes that utilizes the Bernoulli distribution.

Read more


Introduction to Support Vector Machines

So far, I mainly discussed about classification algorithms that use probabilities to make decisions.  However, there are algorithms that don't require the computation of probabilities.  One of the algorithms that do this is called a support vector machine.

Read more


Dataset: Belgium retail market

In this week's dataset, I worked with the Belgium retail market dataset.  In my previous post, I talked about how Apriori can be used to generate association rules.  So, I search for a good dataset that I can use to apply the Apriori algorithm.  The dataset consists of over 88,000 transactions with over 16,000 different items.  While the dataset only contains numbers, we can still apply the algorithm.  This analysis demonstrates how support and confidence influences the amount of rules generated.

Read more


Algorithm: Apriori

So far, I've talked about regression or classification algorithms that can be used to solve problems.  Sometimes though, we just want to discover some associations within our data.  These associations can, in turn, be used by a business to optimize profits.

One of the fundamental algorithms that can be used to solve these kind of problems is called Apriori algorithm.

Read more


Dataset: Anime Recommendations Database

This week's dataset is to determine the most recommended anime from a list of anime shows and user ratings.  To determine a list of recommended shows, I built a very primitive recommendation system based on two criteria:

Read more


Algorithm: ID3

In my decision tree post, I mentioned several different types of algorithms that can be used to create a decision tree.  Today, I'll be talking about a decision tree called the Iterative Dichotomiser 3 (ID3) algorithm.

Read more


Dataset: Mushroom Data Set

This week's dataset is classifying the edibility of mushrooms given several attributes.  I was originally going to do a comparison between Naive Bayes and decision trees on the dataset, but scikit-learn doesn't allow for string arguments when training models.  Additionally, I'm not yet equipped with writing up a decision tree algorithm from scratch.  Despite these setbacks, running Naive Bayes against this dataset yields very good results with 99% accuracy.

Read more


Algorithm: Gaussian Naive Bayes

Recall from my Naive Bayes post that there are several variants.  One of the variants that I'll be talking about today is Gaussian Naive Bayes.

Read more


Probability Distributions and Random Variables

Suppose I had two coins and I flipped both of them.  The possible combinations can be two heads, two tails, or one of each.  These combinations are all part of a sample space.

Now let's take this a step further.  Taking the above demonstration, we want to determine the probability that each combination would occur.

Assuming independence, we derive the following probabilities:

  • P(\text{two heads}) = 0.25
  • P(\text{two tails}) = 0.25
  • P(\text{one of each}) = 0.5

All of these probabilities belong in a probability distribution.

Read more


No dataset this week

I just want to inform you guys that since it's my birthday, I would like to take a quick rest from analyzing a dataset.  However, I will prepare another one for you guys next week.

If you guys have any questions for me, feel free to leave a comment down below or contact me.

Happy coding!