One of the recent datasets that I picked up was a Kaggle dataset called “The Interview Attendance Problem”.  This dataset focuses on job candidates in India attending interviews for several different companies across a few different industries.  The objective is to determine whether a job candidate will be likely to show up or not.

Upon doing an initial exploratory data analysis (EDA), I noticed that the dataset was very dirty and required quite a bit of cleaning up before being able to perform either an EDA or to train a model.  Since it took me a while to actually clean up the data, I decided to dedicate a jupyter notebook exclusively to data cleaning.

What’s lurking in that data?

When going through the data initially, I ended up hitting inconsistent time formats, useless columns, a garbage entry, and redundant values for several columns.

Inconsistent time

When talking about inconsistent time, there were two aspects.  The first aspect was inconsistent date separators.  The majority of dates were represented as dd.mm.YYYY, with some being represented as dd-mm-YYYY.  In pandas, in order to create a datetime index, dates must be formatted in a consistent manner.  The first thing that was needed to do was to convert all separators from periods into dashes.

After converting the separators, another problem was encountered when attempting to create an datetime index.  The issue was caused by some dates omitting the first two digits of the year.  For the conversion to work, we need our dates to be represented as dd-mm-YYYY.

Useless columns

When reading in the file and looking at the first five entries, we notice the following columns in our table:

The last five columns provide no data and is just wasted space.  Fortunately, we can easily remove these columns without losing any data.

Garbage entry

In order to determine how much cleaning was needed, I wanted to generate the number of entries that contained no data. Here is my list of columns:

I noticed that there are almost always at least one row that contained no data.  While it could be a coincidence, I wanted to take a look at the last three entries in the dataset.  Here’s the table of the last three entries.

It appears that there is a garbage entry in the dataset.  Thankfully, we can just remove this one entry.

Redundancy of values

Finally, we simplified and cleaned possible entries to allow us to easily perform EDA and model train in the future.  Here’s an example column that contained quite a bit of repeats in possible values:

Most of these entries required grouping certain values into a general representation of one value (Not yet and Yet to confirm -> Not yet), replacing nonexistent values with a specific value (nan -> Not yet), and lowercasing entries to remove variations of the same word (No and NO -> no).

Conclusion

Before performing any analysis on any dataset, it is very important to take a look and clean up the data.  It’ll not only improve the quality of your analysis, but it can even reduce dataset size.  In this case, the original dataset was 230 kilobytes.  After cleaning, it was reduced to 204 kilobytes, a 12% reduction.  While this isn’t really impressive for this dataset, this would be more noticeable on much bigger datasets.

The notebook can be found on github and Kaggle.  The dataset can also be found on Kaggle.