A Follow-Up on AutoWeber: The Mistakes I Made In Design

In my previous post, I talked about a proof of concept on developing a self-adapting web scraper.  As I was adding onto the project, I was having difficulty adding constraints for improving structure accuracy.  After some time, I came to one conclusion:  My Initial Design Was Flawed!

Read more


A Proof of Concept on a Self-Adapting Web Scraper

Last year, I created the IssueHunt-Statistics website project on tracking repository, issues, and funding for open source projects.  Shortly after, however, the website changed and my project breaks down.  I did change the scraping code to bring back functionality, only for it to break down again a little while later.

I now have a problem.  I don't want to always spend time constantly reworking the scraping code to make it functional.  I wonder if I could automate this task?

Read more


Project: IssueHunt Statistics

To keep up with advances with technology, one activity that software engineers often do is contribute to Open Source.  I'll be restricting this to only contributing to other existing projects, not your own projects.

However, there are some obstacles when contributing:

  • Since many tools used in the community are Open Source, there are very strict standards that must be followed.  Thus, the process of contributing for existing projects can be quite a headache.
  • If a project is small and the owner isn't active on a regular basis, it can be hard for your work to be merged into the project.
  • Some project communities can be toxic.  The Linux kernel community has experienced a lot of toxicity from Linus Torvalds, the Linux founder.
  • Many professional software engineers have non-competing agreements that forbid them from programming in their free times.  Those that don't have other commitments.
  • If you're not getting paid to contribute during working hours, why bother?

Some would see not contributing to Open Source as selfish.  After all, you get to use free tools and you should be grateful.  I honestly don't like this line of thinking.  Not everyone wants to spend their entire time programming.  Some projects have contributing policies that are a hassle to deal with.  Some would like to do a side hustle and earn extra money.

Fortunately, there a couple websites that focus on earning money while contributing to Open Source.  I ran across a few different sites:

  • IssueHunt - I noticed that this site mainly focuses on web projects.  If you want to contribute, I recommend having a background with Javascript and Typescript.
  • BountySource - Has a much more active user base with more variety.
  • Gitcoin - The tasks on this site focuses more on Blockchain.  You can be rewarded with Ethereum as well as cash.

For this post, I'll be mainly focusing on IssueHunt.

Read more


Overview of Apache Spark

For those wanting to work with Big Data, it isn't enough to simply know a programming language and a small scale library.  Once your data reaches many gigabytes, if not terabytes,  in size, working with data becomes cumbersome.  Your computer can only run so fast and store only so much.  At this point, you would look into what kind of tooling is used for massive amounts of data.  One of the tools that you would consider is called Apache Spark.  In this post, we'll look at what is Spark, what can we do with Spark, and why to use Spark.

Read more


The Interview Attendance Problem - Data Cleaning

One of the recent datasets that I picked up was a Kaggle dataset called "The Interview Attendance Problem".  This dataset focuses on job candidates in India attending interviews for several different companies across a few different industries.  The objective is to determine whether a job candidate will be likely to show up or not.

Read more


Popular Kaggle Kernels dataset

With Data Science being a very popular field that people want to get into, it's no surprise that the amount of contributions to Kaggle dramatically increased.  I recently stumbled across a dataset that gathered the most popular kernels and decided to do some exploratory data analysis on the dataset.

Read more