So far, the algorithms that I talked about consisted of modeling the data in a linear manner.  While these algorithms can be effective for simple problems, they don’t suit well where there is a non-linear relationship between features and the output.  Such problems include voice, text, and image recognition, anomaly detection, game playing bots, and any problem where there is no straightforward relationship with the features.

Some non-linear algorithm classes that can solve these kind problems include neural networks, decision trees, and clustering.  These classes often have variants that suit different purposes.  In this post, I’ll be talking about a different classification algorithm called Naive Bayes.

What is Naive Bayes?

Naive Bayes is a set of algorithms that utilize Bayes’ theorem to determine the likelihood that an event would occur.  All of these algorithms assume that the features are independent.  There are several variants of Naive Bayes that should be noted and each will be covered separately in separate posts:

  • Regular – This type of Naive Bayes simply uses Bayesian probabilities to determine the likelihood of an event.
  • Gaussian – A Naive Bayes implementation that utilizes a Normal (Gaussian) distribution to determine the likelihood of an event.
  • Bernoulli – A Naive Bayes implementation that utilizes a Bernoulli distribution.
  • Multinomial – A Naive Bayes implementation that utilizes a Multinomial distribution.

This post will talk about the first type.

The idea for Naive Bayes

As previously mentioned, Naive Bayes uses Bayes’ theorem to determine, given some features, the probability that an event would occur.  For those rusty on Bayes’ theorem, here is the formula:

P(A|B) = \frac{P(AB)}{P(B)}, where P(AB) = P(B|A)P(A)

From this idea, the formula for Naive Bayes is defined as follows:

P(A|B_1, B_2, \ldots, B_n) = \frac{1}{P(B_1, B_2, \ldots, B_n)}(\displaystyle\prod_{i=1}^{n}P(B_i|A))P(A)

As a note, the symbol \displaystyle\prod_{i=1}^{n} is similar to the symbol \displaystyle\sum_{i=1}^{n}, but rather than adding each number within the range [1,n], we’re multiplying each number.

The formula accounts for each feature, B_i, given A.  Since deriving this formula is beyond the scope of this post, I’ll show the math in another post.

Using Naive Bayes

Now that we have the formula:

P(A|B_1, B_2, \ldots, B_n) = \frac{1}{P(B_1, B_2, \ldots, B_n)}(\displaystyle\prod_{i=1}^{n}P(B_i|A))P(A)

How exactly do we work with this algorithm?  Since this is the regular version, we will have two events: A and its complement, A^c.  Determining the probability of A is defined as:

P(A) = \frac{\text{\# of test cases A}}{\text{total \# of test cases}}

Defining the probability ofA^c is easy as well:

P(A^c) = 1 - P(A)

Now, separating all of the test cases into two categories, we can calculate the probability of some feature B_i given an event as:

P(B_i|A) = \frac{\text{total \# of test cases in A containing}B_i}{\text{total \# of test cases in A}}

The same can be done withA^c.

Pros and Cons

One benefit to using Naive Bayes is it’s ability to train with very little data.  The algorithm can also classify test cases very quickly since you’re only multiplying probabilities together.  Unlike several of the more advanced classification algorithms, Naive Bayes works when you have limited resources (CPU and memory) and is easy to code up from scratch.  The functionality of Naive Bayes is straightforward to explain using Bayes’ theorem and basic probability properties.  Finally, since the algorithm assumes features are independent from each other, this simplifies probability calculations.

However, Naive Bayes is not as accurate as the more complex classification algorithms.  Additionally, since we’re talking about the regular version, it does not handle continuous values very well.  If you need to handle these kind of values, Gaussian Naive Bayes would be a better choice.  Finally, its assumption of independent features is also its Achilles’ heel.  That is, it doesn’t take into account features that are dependent on other features.

Conclusion

The Naive Bayes classification algorithm is based on Bayes’ theorem.  It assumes that all features are independent from each other, thus allowing for simplified calculations.  The algorithm is also quick to calculate the likelihood of a category and can be trained with very little data.  While Naive Bayes isn’t as accurate as other classification algorithms, it can be good baseline when solving problems.

Have any questions, comments, or spotted an error?  Leave a comment down below.