With Data Science being a very popular field that people want to get into, it's no surprise that the amount of contributions to Kaggle dramatically increased. I recently stumbled across a dataset that gathered the most popular kernels and decided to do some exploratory data analysis on the dataset.
by Joseph Woolf
One of the hottest tech disciplines in 2017 in the tech industry was Deep Learning. Due to Deep Learning, many startups placed AI emphasis and many frameworks have been developed to make implementing these algorithms easier. Google's DeepMind was even able to create AlphaGo Zero that didn't rely on data to master the game of Go. However, the analysis is much more basic than anything that was recently developed. In fact, the dataset is the popular MNIST database dataset. In other words, the dataset consists of hand written digits to test out computer vision.
It's been a while since I last did an analysis on a dataset. Today's dataset will focus on a corpus that deals with children learning English as a second language. The study was done by Johanne Paradis from the University of Alberta.
In this week's dataset, I worked with the Belgium retail market dataset. In my previous post, I talked about how Apriori can be used to generate association rules. So, I search for a good dataset that I can use to apply the Apriori algorithm. The dataset consists of over 88,000 transactions with over 16,000 different items. While the dataset only contains numbers, we can still apply the algorithm. This analysis demonstrates how support and confidence influences the amount of rules generated.
This week's dataset is to determine the most recommended anime from a list of anime shows and user ratings. To determine a list of recommended shows, I built a very primitive recommendation system based on two criteria:
This week's dataset is classifying the edibility of mushrooms given several attributes. I was originally going to do a comparison between Naive Bayes and decision trees on the dataset, but scikit-learn doesn't allow for string arguments when training models. Additionally, I'm not yet equipped with writing up a decision tree algorithm from scratch. Despite these setbacks, running Naive Bayes against this dataset yields very good results with 99% accuracy.
This week's dataset will be on one of the most well known datasets used in machine learning. Introduced in 1936 by Ronald Fisher, the iris dataset is used to test out the accuracy of machine learning algorithms. Read more
This week's dataset is on Kaggle's Human Resources Analysis. The question that the dataset asks is:
Why are our best and most experienced employees leaving prematurely?
I then asked the following question:
How well can we predict whether an employee is going to leave?
It's definitely possible to answer the second one with great accuracy. I used a decision tree due to the features forming a non-linear relationship. I have a picture of the tree, but it's way too big to upload onto this post.Read more
This week's dataset covers some housing date from Boston Massachusetts. The dataset is provided by UCI and is primarily geared towards regression. The main point of this analysis is to determine how the cross validation error and testing error behaved as the number of cases increased. Read more
This week's dataset explores crimes that occurred in Los Angeles between 2012-2016. I had two objectives in mind when working with this dataset. The first was observing crime patterns to see whether anything interesting popped out. The second was getting more experience manipulating data with pandas.Read more