Popular Kaggle Kernels dataset

With Data Science being a very popular field that people want to get into, it's no surprise that the amount of contributions to Kaggle dramatically increased.  I recently stumbled across a dataset that gathered the most popular kernels and decided to do some exploratory data analysis on the dataset.

Read more


Kaggle's Digit Recognizer dataset

One of the hottest tech disciplines in 2017 in the tech industry was Deep Learning.  Due to Deep Learning, many startups placed AI emphasis and many frameworks have been developed to make implementing these algorithms easier.  Google's DeepMind was even able to create AlphaGo Zero that didn't rely on data to master the game of Go.  However, the analysis is much more basic than anything that was recently developed.  In fact, the dataset is the popular MNIST database dataset.  In other words, the dataset consists of hand written digits to test out computer vision.

Read more


Dataset: Paradis Bilingual Corpus

It's been a while since I last did an analysis on a dataset.  Today's dataset will focus on a corpus that deals with children learning English as a second language.  The study was done by Johanne Paradis from the University of Alberta.

Read more


Dataset: Belgium retail market

In this week's dataset, I worked with the Belgium retail market dataset.  In my previous post, I talked about how Apriori can be used to generate association rules.  So, I search for a good dataset that I can use to apply the Apriori algorithm.  The dataset consists of over 88,000 transactions with over 16,000 different items.  While the dataset only contains numbers, we can still apply the algorithm.  This analysis demonstrates how support and confidence influences the amount of rules generated.

Read more


Dataset: Anime Recommendations Database

This week's dataset is to determine the most recommended anime from a list of anime shows and user ratings.  To determine a list of recommended shows, I built a very primitive recommendation system based on two criteria:

Read more


Dataset: Mushroom Data Set

This week's dataset is classifying the edibility of mushrooms given several attributes.  I was originally going to do a comparison between Naive Bayes and decision trees on the dataset, but scikit-learn doesn't allow for string arguments when training models.  Additionally, I'm not yet equipped with writing up a decision tree algorithm from scratch.  Despite these setbacks, running Naive Bayes against this dataset yields very good results with 99% accuracy.

Read more


No dataset this week

I just want to inform you guys that since it's my birthday, I would like to take a quick rest from analyzing a dataset.  However, I will prepare another one for you guys next week.

If you guys have any questions for me, feel free to leave a comment down below or contact me.

Happy coding!


Dataset: Iris Flower dataset

This week's dataset will be on one of the most well known datasets used in machine learning.  Introduced in 1936 by Ronald Fisher, the iris dataset is used to test out the accuracy of machine learning algorithms.  Read more


Jupyter Notebooks are now on github

I just want to let you guys know that, with the exception of my first dataset post, all past and future dataset analyses will be uploaded onto github.  I decided to do this in case people wanted to expand upon what I did.  For those who do, please credit me in the report.

As usual, I'll still be blogging about the datasets that I analyze and the lessons that I learned.  The github repository can be found here.


Dataset: Human Resources Analysis (Kaggle)

This week's dataset is on Kaggle's Human Resources Analysis.  The question that the dataset asks is:

Why are our best and most experienced employees leaving prematurely?

I then asked the following question:

How well can we predict whether an employee is going to leave?

It's definitely possible to answer the second one with great accuracy.  I used a decision tree due to the features forming a non-linear relationship. I have a picture of the tree, but it's way too big to upload onto this post.Read more