This week’s dataset focuses on popular baby names and their ethnicity in New York city.  When working with the dataset, I asked the following question:

Given a name and gender, how well can a Naive Bayes model predict the ethnicity of a person?

Before I actually trained the model with the dataset, I had to wrangle the data by doing the following:

  • Unrolled all instances of a name based off the count.
  • Converted all names into uppercase.
  • To remove redundant ethnicities (they we just truncated from prior existing entries), I converted them to the more complete form.

After training the dataset, I discovered that Naive Bayes was ineffective with an accuracy of 64%.  Even with removal of the assumption of independence, the accuracy was only minutely improved.

Attached is the PDF of the analysis.

For those interested in analyzing the dataset yourself, here is a direct link to the NYC OpenData dataset.  Found anything interesting in your analysis?  Share your findings in the comment section.

If you have questions or comments about the analysis, feel free to leave a comment and I’ll get back to you ASAP.

Update: There is an error with formatting in the PDF.  I have uploaded a new version with this issue fixed.