Overview of Apache Spark

For those wanting to work with Big Data, it isn't enough to simply know a programming language and a small scale library.  Once your data reaches many gigabytes, if not terabytes,  in size, working with data becomes cumbersome.  Your computer can only run so fast and store only so much.  At this point, you would look into what kind of tooling is used for massive amounts of data.  One of the tools that you would consider is called Apache Spark.  In this post, we'll look at what is Spark, what can we do with Spark, and why to use Spark.

Read more


The Interview Attendance Problem - Data Cleaning

One of the recent datasets that I picked up was a Kaggle dataset called "The Interview Attendance Problem".  This dataset focuses on job candidates in India attending interviews for several different companies across a few different industries.  The objective is to determine whether a job candidate will be likely to show up or not.

Read more


Popular Kaggle Kernels dataset

With Data Science being a very popular field that people want to get into, it's no surprise that the amount of contributions to Kaggle dramatically increased.  I recently stumbled across a dataset that gathered the most popular kernels and decided to do some exploratory data analysis on the dataset.

Read more


Kaggle's Digit Recognizer dataset

One of the hottest tech disciplines in 2017 in the tech industry was Deep Learning.  Due to Deep Learning, many startups placed AI emphasis and many frameworks have been developed to make implementing these algorithms easier.  Google's DeepMind was even able to create AlphaGo Zero that didn't rely on data to master the game of Go.  However, the analysis is much more basic than anything that was recently developed.  In fact, the dataset is the popular MNIST database dataset.  In other words, the dataset consists of hand written digits to test out computer vision.

Read more


How I got into the tech industry

When I was about 5-6 years old, I remembered getting a Nintendo 64 as my first game console.  My favorite game on the system was Super Smash Brothers.  I liked the game so much that, despite being ludicrous, I literally visualized myself in the game.  It was at that point when I wanted to become a game developer.

Read more


Coursera now offers Deep Learning

For those interested in machine learning, Dr. Andrew Ng recently launched his new Coursera specialization course called Deep Learning.  Be prepared to have some Python experience.

Read more


Dataset: Paradis Bilingual Corpus

It's been a while since I last did an analysis on a dataset.  Today's dataset will focus on a corpus that deals with children learning English as a second language.  The study was done by Johanne Paradis from the University of Alberta.

Read more


K-means Clustering Algorithm

In my previous post, I mentioned how there are many different algorithms that can be used to cluster a dataset.  One of the most popular clustering methods used is called the k-means clustering algorithm.

Read more


Introduction to Cluster Analysis

If you were to go online and start shopping, chances are you're getting plowed by many suggestions from online sites.  However, these suggestions aren't random, but rather based on what you recently browsed and purchased.  How did they determine what to recommend and what to ignore?

The system described above is called a recommendation system.  The actual implementation, though, is through the use of a method called clustering.  Clustering, in itself, is part of Cluster Analysis.

Read more


Algorithm: Bernoulli Naive Bayes

In my post on Naive Bayes, I mentioned that there are multiple variants that can be used towards different problems.  In this post, I will be introducing another variant of Naive Bayes that utilizes the Bernoulli distribution.

Read more