This week’s dataset will be on one of the most well known datasets used in machine learning. Introduced in 1936 by Ronald Fisher, the iris dataset is used to test out the accuracy of machine learning algorithms. The dataset has three plant types that need to be classified:
The datasets is simple since it only has four features that can be used classify the plant types. The features are:
sepal width and length
pedal width and length
Now, why didn’t I start out with this dataset in my first post? For one thing, this dataset has been worked on so many times that you can literally find any analysis done on this dataset. Second, there isn’t much I can do with the dataset besides just measuring the efficiency of the machine learning algorithms. I can’t really ask any questions about the plant types. Finally, data in the real world is neither structured or cleaned. There’s more to data analysis than just running algorithms. You not only have to prepare the data, but you also have interpret what the data means.
But wait, why would you work on this dataset then? Well, now that I covered a few algorithms, we can compare the efficiency of several machine learning algorithms. When one works on a dataset, it’s common practice to use several models to compare before choosing the most ideal one.
The analysis compares against four different algorithms:
Support Vector Machine
The decision tree was the most effective algorithm with 100% accuracy. However, as I’ll discuss in the future about decision trees, it’s also very poor when it comes to flexibility.
Here is the PDF of my analysis. The notebook can also be found on github.
For those interested in analyzing the dataset yourself, here is a direct link to the UCI Machine Learning Repository. Found anything interesting in your analysis? Share your findings in the comment section.
If you have questions or comments about the analysis, feel free to leave a comment and I’ll get back to you ASAP.