In my decision tree post, I mentioned several different types of algorithms that can be used to create a decision tree.  Today, I’ll be talking about a decision tree called the Iterative Dichotomiser 3 (ID3) algorithm.

What is ID3?

ID3 allows the user to generate a decision tree from a dataset.  It was invented by Ross Quinlan in 1986 and was the first in a series of decision tree algorithms that he introduced.  He later improved upon ID3 with the C4.5 decision tree algorithm.

How does ID3 generate a tree?

ID3 generates a decision tree in a top-down approach.  That is, we start at the node of a tree and we build the tree downward.  The algorithm is finished building when we generated all of the terminating nodes.  So, how does ID3 determine a good way to build the tree?

Deriving from Information Theory, ID3 uses two functions to determine the best way to build a tree:

  1. Entropy
  2. Information gain

Suppose we have a dataset S.  When we start the algorithm, we must first calculate the entropy of the dataset.

The entropy function, H(S), is used to measure the unpredictability of a given sample X.  X holds the possible classes that we can classify, denoted by x.  The formula is defined as follows:

H(S) = -\displaystyle\sum_{x \in X}p(x)*log_2 (p(x))

Then, we begin to separate each feature to determine which one can influence the algorithm the most.  This is where information gain comes in.

Let k be the possible values for a given feature, K.  Information gain, IG(K, S) is determined by the difference between H(S) and the sum of each entropy of feature k, defined as H(k).

IG(K, S) = H(S) - \displaystyle\sum_{k \in K}p(k)H(k)

The information gain for each feature is calculated.  The highest value determines which feature will be mapped next in the tree.  The algorithm then works down each child node, creates a subset from the dataset containing only that value, and applies the same logic until a single class is reached.  When there are no more children, the algorithm returns the tree.


Suppose we want to determine whether someone is overweight (do note that this is only for demonstrating ID3 only):


We first determine the entropy of our dataset:

[latex size = “2”] H(overweight) = -(2/3)\log_2 (2/3) -(1/3)\log_2 (1/3) = 0.918 [/latex]

We now need to determine which feature has the greatest information gain:

IG(gender, overweight) = H(overweight) - \displaystyle\sum_{k \in gender}p(k)H(k) = 1.836

IG(weight, overweight) = H(overweight) - \displaystyle\sum_{k \in weight}p(k)H(k) = 0.918

IG(height, overweight) = H(overweight) - \displaystyle\sum_{k \in height}p(k)H(k) = 1.584

So, from this information, gender has the highest information gain.  We would then go down one of the children (we’ll do female) and construct a subtree using instances {2,4,5}.  We would calculate the new entropy and information gain from this subset.

H(overweight) = -(2/3)\log_2 (2/3) -(1/3)\log_2 (1/3) = 0.918

IG(weight, overweight) = H(overweight) - \displaystyle\sum_{k \in weight}p(k)H(k) = 0.918

IG(height, overweight) = H(overweight) - \displaystyle\sum_{k \in height}p(k)H(k) = 1.584

Since height gives the most information, we now have height as the next node.

Following this logic, we now know that any tall girl is not overweight.  As for girls who are short, we must further determine whether they are overweight.

Here’s the illustration that ID3 created in the first several steps:

Partial decision tree constructed from ID3.

Pros and Cons

ID3 has several benefits:

  • Compared to other decision tree algorithms, ID3 is a simple algorithm to understand.
  • It can accurately classify data.
  • For those wanting to implement the algorithm by scratch, it’s a good way to practice tree traversal and recursion.

However, ID3 has several drawbacks:

  • ID3 suffers from high variance.
  • ID3 doesn’t always generate the most optimal tree.  This can cause longer search times.
  • ID3 can’t handle continuous variables very well.


ID3 was one of the first decision tree algorithms that were used to classify data.  With a reasonably sized dataset, the algorithm can produce good decision trees.  However, it is not the most optimal algorithm for producing decision trees.  Additionally, the algorithm works best with categorical data, not quantitative data.  Despite these limitations, ID3 does provide a good stepping stone to more complex decision tree algorithms.

For those curious, here is Quinlan’s paper on the ID3 decision tree.

Have any questions, comments, or spotted an error.  Leave a comment down below and I’ll get back to you ASAP.