This week’s dataset is on Kaggle’s Human Resources Analysis. The question that the dataset asks is:
Why are our best and most experienced employees leaving prematurely?
I then asked the following question:
How well can we predict whether an employee is going to leave?
It’s definitely possible to answer the second one with great accuracy. I used a decision tree due to the features forming a non-linear relationship. I have a picture of the tree, but it’s way too big to upload onto this post.
However, as I was taking a look at the data, I noticed that providing an answer to the first question would not be as straightforward. And this realization is part of the challenge of doing data science/analysis. The objective of doing this kind of analysis is to provide a business solution to a problem. In the real world, data isn’t straightforward or clean. There’s several reasons for this:
The data can be missing values for all sorts of reasons. We didn’t have this issue in the dataset.
Vital parts that can be used to make better decisions are absent. Our dataset suffers from this issue in several ways:
There lacks a column where a reason was provided on why they left.
We have no knowledge of the company culture.
We don’t have balance sheets to determine how well the company is doing.
Sometimes, the data simply isn’t available. In the case of company feedback/performances of employees, data is often proprietary. Data could have also been simply lost.
The data values don’t make sense and there’s either no explanation on what it means or it’s poorly explained.
The data could be outdated. Time wasn’t really a factor in the analysis, but, generally speaking, past data doesn’t always indicate future behavior (this is obvious with the stock market).
The list goes on…
As a result, issues in data can lead to incorrect conclusions/solutions, which is very fatal. In the case of this dataset, better quality data could allow us to provide better solutions to retain employees.