In my previous algorithm post, I talked about a family of algorithms called Naive Bayes.  These algorithms used Bayes’ theorem, independence, and probabilities to determine whether a test case can be positively categorized.  However, these algorithms don’t take into account the relationships between features.  Additionally, it would be nice to visualize how the model actually made decisions.  Fortunately, decision trees allow us to visualize the relationship of each property for classifying categories.

What are Decision Trees?

A decision tree utilizes a tree or a graph to map out how data would be categorized.  The idea is similar to the following flow chart:

In the flow chart above, we start at the top node and ask whether the person was male.  If not, then we know the person survived and we would stop asking further questions.  If we did, then we would have to ask additional questions until we hit a leaf node.

These algorithms are typically supervised learning algorithms and are used in classification problems.  However, decision trees can be used in regression problems.

Just like Naive Bayes, there are several variants of a decision tree:

• ID3 – A primitive decision tree algorithm invented by Ross Quinlan.  Doesn’t handle continuous data very well.
• C4.5 – An improvement over ID3.  It can handle continuous data.
• C5 – An improvement over C4.5.  It’s much faster and takes less memory than C4.5.
• CART – This type can create either a classification tree or a regression tree.
• CHAID – A technique based on Bonferroni testing
• MARS – A tree based on regression analysis.

I will write a separate post on how ID3 works.  While the other algorithms are move involved, I’ll do my best to talk about them in separate posts.  Regardless, the list provides several options you can choose to construct a decision tree.

How does a Decision Tree work?

So now that I provided several variants of a decision tree, how does a decision tree determine the best approach to build the tree?

Oftentimes, the tree is built in a top down approach.  That is, given a category we want to classify, the algorithm would start at the root and build the tree downward.  To determine the best way to classify the categories, these algorithms use a metric to determine the next feature to be mapped.  The actual metric used would depend on algorithm, but most of the time, entropy and information gain is used to determine how the tree will be mapped.

I’ll discuss these kind of metrics when I cover some of the decision tree algorithms in detail.

Pros and Cons

Due to their nature, decision trees have several advantages:

• Decision trees can be easily understood since they can be visualized.  This property allows them to be a white box model.
• Decision trees better shows the thought process that a human being might use to make decisions.
• Depending on the how many layers and balanced the tree is, searches can be done quickly.
• There is very little data preparation needed to generate a tree.
• The data can handle a large amount of data.

Despite these properties, there are several disadvantages to using a decision tree:

• Determining the most optimal way to map the tree cannot be done efficiently (this problem is NP-complete).
• Additionally, once a tree is formed, it cannot handle new patterns very well.  This is known as over-fitting your data.
• Depending on the algorithm, the tree might be generated in a very complex manner.  This can lead to longer search time.

Conclusion

When it comes to representing your data, decision trees can create a model to make decisions based on the provided dataset.  In fact, due to its white box approach, decision trees can easily be interpreted by even non-statisticians.  However, decision trees don’t guarantee the most optimal model to represent the data.  Additionally, they cannot handle new patterns without redoing the tree.  Despite these weaknesses, decision trees allow for a simple, yet powerful way to model a dataset.