It’s been a while since I last did an analysis on a dataset.  Today’s dataset will focus on a corpus that deals with children learning English as a second language.  The study was done by Johanne Paradis from the University of Alberta.

What is a Corpus?

For those interested in linguistics, a corpus is a collection of words that are used to conduct statistical analysis.  If you plan on doing Natural Language Processing (NLP), then it would be ideal to play around with these collections.

The Analysis

Unlike other datasets that I talked about, this one doesn’t have a well defined structure.  You have one file in a CSV format that gives some information like the age of exposure to English, the age at recording, the amount of English learned, the primary language, and the files that contain the actual dialog.

Once you go into the dialog files, there’s metadata, speech from the expert, speech from the child, speech from the mother, and actions that the subject did.  While one could store the dialog into a pandas DataFrame, you wouldn’t be able to query through the data effectively.  However, one could process the text to find key answers to several questions.

In this analysis, I answered the following questions:

  • How many times did a child say ums and uhs?
  • How many times did a child pause?
  • How many interruptions did a child make?
  • Were there any meaningful correlations between age and the number of interruptions?
  • Were there any meaningful correlations between age and the number of ums?
  • Were there any meaningful correlations between age and the number of pauses?
  • Were there any meaningful correlations between the amount of English learned and the number of pauses?
  • Which language groups had the least and most amount of pauses?
  • Which language groups had the least and most amount of ums?
  • Which language groups had the least and most amount of interruptions?
  • Which language groups had the least and most amount pauses, ums, and interruptions combined?

After performing the analysis, I found that there wasn’t any meaningful correlation between the age of the child, the amount of English learned, and the amount of pauses, ums, and interruptions made.  From this result, it can be said that the amount of pauses, ums, and interruptions made is largely dependent on the personality of the child.

In fact, we don’t even need to study children to see these results.  Consider some of the following behaviors made by adults:

  • People that are outspoken are more likely to interrupt than others.
  • People that are not good at public speaking are more prone to pausing and saying ums.
  • Sometimes, in order to collect their thoughts, people pause prior to speaking.  People who are charismatic often do this.
  • However, pausing too much or for too long can be socially awkward.  Maybe they’re uncomfortable in social settings.

So the moral of the story is that even though anyone can run statistical analysis on any dataset, it still takes someone to interpret the results in a meaningful manner.  This is where a person with domain expertise in a particular subject has an advantage.

The notebook can be found on Kaggle and the github page.  For those wanting to analyze the dataset themselves, the dataset can be found on Kaggle.

The original research page can be found on the CHILDES website.

What did find interesting about the dataset?  Were you able to discover any interesting trends?  Feel free to share your results in the comment section down below.