Popular Kaggle Kernels dataset

With Data Science being a very popular field that people want to get into, it’s no surprise that the amount of contributions to Kaggle dramatically increased.  I recently stumbled across a dataset that gathered the most popular kernels and decided to do some exploratory data analysis on the dataset.

Continue reading Popular Kaggle Kernels dataset

Kaggle’s Digit Recognizer dataset

One of the hottest tech disciplines in 2017 in the tech industry was Deep Learning.  Due to Deep Learning, many startups placed AI emphasis and many frameworks have been developed to make implementing these algorithms easier.  Google’s DeepMind was even able to create AlphaGo Zero that didn’t rely on data to master the game of Go.  However, the analysis is much more basic than anything that was recently developed.  In fact, the dataset is the popular MNIST database dataset.  In other words, the dataset consists of hand written digits to test out computer vision.

Continue reading Kaggle’s Digit Recognizer dataset

Dataset: Belgium retail market

In this week’s dataset, I worked with the Belgium retail market dataset.  In my previous post, I talked about how Apriori can be used to generate association rules.  So, I search for a good dataset that I can use to apply the Apriori algorithm.  The dataset consists of over 88,000 transactions with over 16,000 different items.  While the dataset only contains numbers, we can still apply the algorithm.  This analysis demonstrates how support and confidence influences the amount of rules generated.

Continue Reading

Dataset: Mushroom Data Set

This week’s dataset is classifying the edibility of mushrooms given several attributes.  I was originally going to do a comparison between Naive Bayes and decision trees on the dataset, but scikit-learn doesn’t allow for string arguments when training models.  Additionally, I’m not yet equipped with writing up a decision tree algorithm from scratch.  Despite these setbacks, running Naive Bayes against this dataset yields very good results with 99% accuracy.

Continue Reading

No dataset this week

I just want to inform you guys that since it’s my birthday, I would like to take a quick rest from analyzing a dataset.  However, I will prepare another one for you guys next week.

If you guys have any questions for me, feel free to leave a comment down below or contact me.

Happy coding!

Jupyter Notebooks are now on github

I just want to let you guys know that, with the exception of my first dataset post, all past and future dataset analyses will be uploaded onto github.  I decided to do this in case people wanted to expand upon what I did.  For those who do, please credit me in the report.

As usual, I’ll still be blogging about the datasets that I analyze and the lessons that I learned.  The github repository can be found here.

Dataset: Human Resources Analysis (Kaggle)

This week’s dataset is on Kaggle’s Human Resources Analysis.  The question that the dataset asks is:

Why are our best and most experienced employees leaving prematurely?

I then asked the following question:

How well can we predict whether an employee is going to leave?

It’s definitely possible to answer the second one with great accuracy.  I used a decision tree due to the features forming a non-linear relationship. I have a picture of the tree, but it’s way too big to upload onto this post. Continue Reading