This week’s dataset is classifying the edibility of mushrooms given several attributes.  I was originally going to do a comparison between Naive Bayes and decision trees on the dataset, but scikit-learn doesn’t allow for string arguments when training models.  Additionally, I’m not yet equipped with writing up a decision tree algorithm from scratch.  Despite these setbacks, running Naive Bayes against this dataset yields very good results with 99% accuracy.

Now, how well can the data actually interpret real mushrooms?

In my analysis on Kaggle’s Human Resources Analysis, data usually doesn’t give us critical pieces that are needed to answer questions.  The mushroom dataset has a few issues:

  • The dataset only lists traits and whether they’re edible.  It does not consider the actual species of mushroom we’re testing.
  • There are mushrooms that share many of the same traits.  There might be a distinct traits that is not taken into account in the dataset.
  • If some mushrooms are cooked, they can be edible.  This is not considered in the dataset.
  • While not a critical issue, some test cases are missing the stalk root type.  This could either be due to not being known or the mushroom doesn’t have a stalk root.

These flaws do lead to another lesson.  A person who has domain expertise in a particular field could better work, model, and interpret the data.  In our case, a mushroom forager (someone who hunts for mushrooms) could add features, such as taxa, or ask very different questions.

For example, the dataset asked about the edibility of a mushroom given some traits.  However, a mushroom forager could ask “What species of mushrooms does the data describe?”  Asking this question might allow him to indirectly determine the edibility of a mushroom.  While the example is laughable, there is a point to made here.  An expert who works on the dataset can ask better questions and apply better models to derive a more effective solution.

The PDF on my analysis can be found here.  The notebook can be found on github.

The original dataset can be found on the UCI Machine Learning Repository.

For those who analyzed the dataset, what other models did you apply?  What other features would you add to increase accuracy?  Share your findings in the comment section.

If you have questions or comments about the analysis, feel free to leave a comment and I’ll get back to you ASAP.