Dataset: Paradis Bilingual Corpus

It's been a while since I last did an analysis on a dataset.  Today's dataset will focus on a corpus that deals with children learning English as a second language.  The study was done by Johanne Paradis from the University of Alberta.

Read more


K-means Clustering Algorithm

In my previous post, I mentioned how there are many different algorithms that can be used to cluster a dataset.  One of the most popular clustering methods used is called the k-means clustering algorithm.

Read more


Introduction to Cluster Analysis

If you were to go online and start shopping, chances are you're getting plowed by many suggestions from online sites.  However, these suggestions aren't random, but rather based on what you recently browsed and purchased.  How did they determine what to recommend and what to ignore?

The system described above is called a recommendation system.  The actual implementation, though, is through the use of a method called clustering.  Clustering, in itself, is part of Cluster Analysis.

Read more


Algorithm: Bernoulli Naive Bayes

In my post on Naive Bayes, I mentioned that there are multiple variants that can be used towards different problems.  In this post, I will be introducing another variant of Naive Bayes that utilizes the Bernoulli distribution.

Read more


Introduction to Support Vector Machines

So far, I mainly discussed about classification algorithms that use probabilities to make decisions.  However, there are algorithms that don't require the computation of probabilities.  One of the algorithms that do this is called a support vector machine.

Read more


Dataset: Belgium retail market

In this week's dataset, I worked with the Belgium retail market dataset.  In my previous post, I talked about how Apriori can be used to generate association rules.  So, I search for a good dataset that I can use to apply the Apriori algorithm.  The dataset consists of over 88,000 transactions with over 16,000 different items.  While the dataset only contains numbers, we can still apply the algorithm.  This analysis demonstrates how support and confidence influences the amount of rules generated.

Read more


Algorithm: Apriori

So far, I've talked about regression or classification algorithms that can be used to solve problems.  Sometimes though, we just want to discover some associations within our data.  These associations can, in turn, be used by a business to optimize profits.

One of the fundamental algorithms that can be used to solve these kind of problems is called Apriori algorithm.

Read more


Dataset: Anime Recommendations Database

This week's dataset is to determine the most recommended anime from a list of anime shows and user ratings.  To determine a list of recommended shows, I built a very primitive recommendation system based on two criteria:

Read more


Algorithm: ID3

In my decision tree post, I mentioned several different types of algorithms that can be used to create a decision tree.  Today, I'll be talking about a decision tree called the Iterative Dichotomiser 3 (ID3) algorithm.

Read more


Dataset: Mushroom Data Set

This week's dataset is classifying the edibility of mushrooms given several attributes.  I was originally going to do a comparison between Naive Bayes and decision trees on the dataset, but scikit-learn doesn't allow for string arguments when training models.  Additionally, I'm not yet equipped with writing up a decision tree algorithm from scratch.  Despite these setbacks, running Naive Bayes against this dataset yields very good results with 99% accuracy.

Read more