This week’s dataset covers some housing date from Boston Massachusetts.  The dataset is provided by UCI and is primarily geared towards regression. The main point of this analysis is to determine how the cross validation error and testing error behaved as the number of cases increased.  Here is how I separated the cases (Note: This approach is often recommended when doing machine learning):

  • Training set – 60 %
  • Cross validation set – 20 %
  • Testing set – 20 %

For this analysis, I used multiple linear regression.  From the results, here’s what I discovered:

  • As we got to the maximum number of test cases, there was high variance for every features.
  • Predicting for crime rate, nitric oxide concentration, index of highway accessibility, distances to five employment centers, and proportion of blacks by town have high biases until around 450 test cases.
  • Most of the time, the Charles River dummy variable had high variance.

Attached is the PDF to test difference of error.

For those interested in analyzing the dataset yourself, here is a direct link to the UCI dataset and a link to the dataset I actually worked with for this analysis.  Found anything interesting in your analysis?  Share your findings in the comment section.

If you have questions or comments about the analysis, feel free to leave a comment and I’ll get back to you ASAP.