With Data Science being a very popular field that people want to get into, it’s no surprise that the amount of contributions to Kaggle dramatically increased.  I recently stumbled across a dataset that gathered the most popular kernels and decided to do some exploratory data analysis on the dataset.

What did I focus on?

When I analyzed the dataset, I asked a few questions:

  • Who has the most amount of popular kernels?
  • Who has received the most comments, votes, forks, and views?
  • What language was used most often?
  • What tags were popular?

By asking these questions, I did discover the following:

  • An employee has the most amount of popular kernels.
  • The person who received the most in anything only received the most in one thing.
  • Python (unsurprisingly) was the most common language.  Surprisingly, R was used less than markdown.
  • As for tags, data visualization was the most popular tag.  Tutorial was the second most popular tag.

Limitations

Despite the dataset being interesting, the main drawback is that this dataset is not in real-time.  The dataset was last updated on February 26, 2018.  Thus, the analysis in the dataset would no longer be correct.

In the real world, one important criteria is to have data that is most up-to-date.  Without the most recent data, it becomes very hard, if not impossible, to gain insight.  For this dataset, though, this criteria is not too critical as it just holds curious information on Kaggle kernels.

Conclusion

For those that want to learn Data Science from other people, Kaggle is a good way do so.  The most popular kernels have often helped people understand the first steps in doing Data Science.  My journal on the analysis can be found on Github and Kaggle.

Interested in the dataset?  The dataset can be found on the Kaggle site.  You will need an account to download data and submit kernels.