In my previous post, I introduced a class of algorithms for solving classification problems.  I also mentioned that Naive Bayes is based off of Bayes’ theorem.  In this post, I will derive Naive Bayes using Bayes’ theorem.

Deriving

According to Bayes’ theorem, we can find the probability of some event given some feature as follows:

P(A|B) = \frac{P(AB)}{P(B)}, where P(AB) = P(B|A)P(A)

Now, what if you’re dealing with multiple features?  Let’s start with the probability of P(A|BC).  Here, we have an event that depends on the features B and C.  How would we break this down?

We know that:

P(A|BC) = \frac{P(ABC)}{P(BC)}

Breaking down P(ABC), we get:

P(ABC) = P(B|AC)P(AC)

We can also break down P(AC) as:

P(AC) = P(C|A)P(A)

So, for P(ABC), we now have:

P(ABC) = P(B|AC)P(C|A)P(A)

In fact, what we just demonstrated is the Chain (or multiplicative) rule for probability.  The property can be generalized as follows:

P(\displaystyle\bigcap_{k=1}^{n}{A_k}) = \displaystyle\prod_{k=1}^{n}P(A_k|\displaystyle\bigcap_{j=1}^{k-1}{A_j})

Note that \displaystyle\bigcap_{k=1}^{n} is the intersection of variables from 1 to n.  Now, we could have instead broke down P(ABC) to:

P(ABC) = P(C|AB)P(B|A)P(A)

They’re equivalent, but the usefulness of each form depends on how the features depend on each other.

Plugging back into the formula, we get:

P(A|BC) = \frac{P(B|AC)P(C|A)P(A)}{P(BC)}

However, there is a problem with the formula.  The features and events are dependent on each other.  This approach would make it too tedious to be used as a model.  So now what?

Recall that Naive Bayes assumes independent features.  When the variables are independent, we can drop variables in the given portion of the probability since the variables are no longer dependent on other variables:

P(C|AB)P(B|A)P(A) = P(C|A)P(B|A)P(A)

We don’t, however, drop A since B and C depends on A.

So, we can now rewrite P(A|BC) as:

P(A|BC) = \frac{P(B|A)P(C|A)P(A)}{P(BC)}

Now, generalizing from the equation above, we get the following:

P(A|\displaystyle\bigcap_{i=1}^{n}B_i) = \frac{1}{P(\displaystyle\bigcap_{i=1}^{n}B_i)}P(A)\displaystyle\prod_{k=1}^{n}P(B_k|A)

Have any questions, comments, or spotted an error?  Leave a comment down below.