Blog

Project: IssueHunt Statistics

To keep up with advances with technology, one activity that software engineers often do is contribute to Open Source.  I’ll be restricting this to only contributing to other existing projects, not your own projects.

However, there are some obstacles when contributing:

  • Since many tools used in the community are Open Source, there are very strict standards that must be followed.  Thus, the process of contributing for existing projects can be quite a headache.
  • If a project is small and the owner isn’t active on a regular basis, it can be hard for your work to be merged into the project.
  • Some project communities can be toxic.  The Linux kernel community has experienced a lot of toxicity from Linus Torvalds, the Linux founder.
  • Many professional software engineers have non-competing agreements that forbid them from programming in their free times.  Those that don’t have other commitments.
  • If you’re not getting paid to contribute during working hours, why bother?

Some would see not contributing to Open Source as selfish.  After all, you get to use free tools and you should be grateful.  I honestly don’t like this line of thinking.  Not everyone wants to spend their entire time programming.  Some projects have contributing policies that are a hassle to deal with.  Some would like to do a side hustle and earn extra money.

Fortunately, there a couple websites that focus on earning money while contributing to Open Source.  I ran across a few different sites:

  • IssueHunt – I noticed that this site mainly focuses on web projects.  If you want to contribute, I recommend having a background with Javascript and Typescript.
  • BountySource – Has a much more active user base with more variety.
  • Gitcoin – The tasks on this site focuses more on Blockchain.  You can be rewarded with Ethereum as well as cash.

For this post, I’ll be mainly focusing on IssueHunt.

Project Basis

As with other bounty sites, IssueHunt allows users to fund issues, submit pull requests, and get rewarded.  However, I noticed that IssueHunt has very little statistics.  My project involves creating a website to web scrape portions of the IssueHunt website and represent the data in a meaningful way.

Choice of Infrastructure

Since IssueHunt is a small website, I chose to keep the current infrastructure rather simple.  For the front-end, I chose React as the main framework.  I import the react-chartjs-2 module for graphing data.  On the backend, I chose NodeJS for the server.

Our data will be stored using MySQL 8.  I chose MySQL for two reasons.  First, our data can easily be classified as relational.  This attributes allows for optimization for accessing data.  The other reason is since IssueHunt is currently a small bounty site, it would be an overkill to use a NoSQL database.

What Analysis Can It Do?

The website displays some statistics about the repositories and the issues.  The site mostly uses line charts to convey the data.  The only exception is the number of issues of a particular status.

Currently, the following general questions can be answered:

  • How many repositories are available?
  • How often are repositories added to the site?
  • What is the total amount funded on the website?
  • What is the total amount of active funded are on the website?
  • How many issues are open?  – Note that an open issue means that an issue is funded or unfunded.

For a given repository:

  • How many issues are unfunded, funded, submitted, and rewarded?
  • What’s the average, median, and mean price of a funded, submitted, and rewarded issue?

Missing Features and Data

Filtering

As of time of writing, there’s currently no way to filter repositories by name or by attributes such as:

  • Contains issues
  • List by most issues
  • List by most funded

Representing Issues in a Time Series

Issues are only displayed as of the last web scraping period.  There is currently no data for displaying how many of each type of issue occurred on a particular day.  Fortunately, IssueHunt does have an activity section that can show what happened for a particular issue.  It can implemented in the future.

Additional Data Analysis

The site doesn’t contain or have straightforward data for answering the following questions:

  • Which repository contains the most issues?
  • Which repository has the most rewarded issues?
  • What is the most expensive issue funded/rewarded?
  • Which repository contains the most submitted issues?
  • Which repository contains the most unfunded issues?
  • Which repository is worked on the least?
  • How many repositories have at least one issue?
  • Who has submitted the most issues for a given repository?
  • What’s the most popular primary language used?
  • Who has been rewarded the most for a given repository?
  • Who has been rewarded the most in general?
  • Etc…

Conclusion

If you wanted to make a little money on the side while honing your software development skills, IssueHunt could be a viable option.  I created this project to give me a more in depth look on activity.  I hope you’ll find the site useful.

I’ll keep adding more features in the meantime to make the site more useful.  The repository can be found on my github account.

Natural Language Processing: Working With Human Readable Data

Most of the models in machine learning requires working with numbers.  After all, much of the machine learning algorithms we’ve seen are derived from statistics (Linear Regression, Logistic Regression, Naive Bayes, etc.).  Additionally, machines can understand and work with numbers a lot easier than us human.

However, machines just process the numbers and execute algorithms.  They don’t interpret the numbers returned.  They don’t understand the context of the data.  They especially don’t understand human intricacies and can easily be taken advantage by rouge players.

So then, is it actually possible for computers to understand humans?  Can we ever have conversations with computers?  In a sense, we already can!  This is thanks to a branch of AI called Natural Language Processing.

Continue reading Natural Language Processing: Working With Human Readable Data

When Your Model Is Inaccurate

Let’s imagine you’re doing research on an ideal rental property.  You gather your data, open up your favorite programming environment and you get to work on perform Exploratory Data Analysis (EDA).  During your EDA, you find some dirty data and clean it to train on.  You decide on a model, separate the data into training, validation, and testing, and train your model on the cleaned data.  Upon evaluating your model using some validation and test data, you notice that your validation error is very high as well as your test error.

Now suppose you pick a different model or add additional features.  Now your validation error is much lower.  Great!  However, upon using your testing data, you notice that the error is still high.  What just happened?

Continue reading When Your Model Is Inaccurate

What are Neural Networks?

I admit, I’m late to the whole Neural Network party.  With all of the major news covering AI that use neural network as part of their implementation, you’d have to be living under a rock to not know about them.  While it’s true that they can provide more flexible models compared to the other machine learning algorithms, they can be challenging to work with.

Continue reading What are Neural Networks?

Setting up OpenCV for Java via Maven

When you learn about OpenCV, you’ll often hit up on OpenCV for Python or C++, but not Java.  I can understand that OpenCV is a glorified NumPy extension for Python and OpenCV C++ is very fast.  However, it’s possible that you have a legit need to use Java instead of Python or C++.

In a professional setting, Java users are likely to use Apache Maven to allow everyone to get the same version of each software without causing build and run issues.  Sure, you can always install the library and setup the CLASSPATH to point at OpenCV, but I find it better to use Maven to handle the libraries.  Just note that there is no official Maven repository for OpenCV at the time of writing, but there been others that have uploaded alternative repositories.

Repository for OpenCV 2

For those that are using OpenCV 2 and Java, you’d want to use the nu.pattern repository.  Here is the line of code needed to import OpenCV:

<!-- https://mvnrepository.com/artifact/nu.pattern/opencv -->
<dependency>
    <groupId>nu.pattern</groupId>
    <artifactId>opencv</artifactId>
    <version>2.4.9-4</version>
</dependency

Repository for OpenCV 3

For those needing to use OpenCV 3, the repository will be different.  There is no nu.pattern equivalent version for OpenCV 3.  You will need to use the following repository instead:

<!-- https://mvnrepository.com/artifact/org.openpnp/opencv -->
<dependency>
    <groupId>org.openpnp</groupId>
    <artifactId>opencv</artifactId>
    <version>3.4.2-0</version>
</dependency>

Repository for OpenCV 4

Yes.  There will be an OpenCV 4 being released soon.  As a result, don’t expect one for OpenCV 4 just yet.  If you’re interested in installing an early release of OpenCV 4 for Python, Adrien Rosebrock has posted some instructions for Mac OS X and Ubuntu users.

Loading the OpenCV Library

After adding the repository to your Maven file, you need to load the library for use.  Normally, you would use the line:

System.loadLibrary(Core.NATIVE_LIBRARY_NAME)

However, this method won’t work as it relies on the OpenCV libraries actually being installed.  Instead, you need to do the following:

nu.pattern.OpenCV.loadShared();
nu.pattern.OpenCV.loadLocally(); // Use in case loadShared() doesn't work

Once you call one of these methods, you should be able to use OpenCV normally.  OpenCV Java is akin to OpenCV C++, so you should be able to transfer some of the knowledge over to the other programming language.

Sport Recommendation Exercise

Sports.  Sports.  Sports.

Some people love watching them.  Others love playing them.  The US love their football while those in Latin America love their soccer.  As much as we fight and bicker about which sport is the best or that our favorite team is the best, many people love sports as a pastime and follow their teams religiously.

While I’m not a sports fan, I did come across an interesting dataset from data.world that determine what was the toughest sport to pick up.  Even though this dataset is framed in an objective manner, I would like to ask a different question: Based on the sports data and a person’s abilities, what sport would be optimal for them?

Continue reading Sport Recommendation Exercise

Overview of Apache Spark

For those wanting to work with Big Data, it isn’t enough to simply know a programming language and a small scale library.  Once your data reaches many gigabytes, if not terabytes,  in size, working with data becomes cumbersome.  Your computer can only run so fast and store only so much.  At this point, you would look into what kind of tooling is used for massive amounts of data.  One of the tools that you would consider is called Apache Spark.  In this post, we’ll look at what is Spark, what can we do with Spark, and why to use Spark.

Continue reading Overview of Apache Spark

The Interview Attendance Problem – Data Cleaning

One of the recent datasets that I picked up was a Kaggle dataset called “The Interview Attendance Problem”.  This dataset focuses on job candidates in India attending interviews for several different companies across a few different industries.  The objective is to determine whether a job candidate will be likely to show up or not.

Continue reading The Interview Attendance Problem – Data Cleaning

Popular Kaggle Kernels dataset

With Data Science being a very popular field that people want to get into, it’s no surprise that the amount of contributions to Kaggle dramatically increased.  I recently stumbled across a dataset that gathered the most popular kernels and decided to do some exploratory data analysis on the dataset.

Continue reading Popular Kaggle Kernels dataset

Kaggle’s Digit Recognizer dataset

One of the hottest tech disciplines in 2017 in the tech industry was Deep Learning.  Due to Deep Learning, many startups placed AI emphasis and many frameworks have been developed to make implementing these algorithms easier.  Google’s DeepMind was even able to create AlphaGo Zero that didn’t rely on data to master the game of Go.  However, the analysis is much more basic than anything that was recently developed.  In fact, the dataset is the popular MNIST database dataset.  In other words, the dataset consists of hand written digits to test out computer vision.

Continue reading Kaggle’s Digit Recognizer dataset