Let's imagine you're doing research on an ideal rental property. You gather your data, open up your favorite programming environment and you get to work on perform Exploratory Data Analysis (EDA). During your EDA, you find some dirty data and clean it to train on. You decide on a model, separate the data into training, validation, and testing, and train your model on the cleaned data. Upon evaluating your model using some validation and test data, you notice that your validation error is very high as well as your test error.
Now suppose you pick a different model or add additional features. Now your validation error is much lower. Great! However, upon using your testing data, you notice that the error is still high. What just happened?
by Joseph Woolf