Data inherently comes with noise, so we are modeling the underlying function and any possible noise
The best model is one which approximates the general behavior of the data (train and test) with the lowest errors.
We want to model how data behaves in the real world
To simulate the testing set, we can use cross-validation, where we use different subsets of the data for validation
The model with the lowest accumulated error across the subset would then be the best model
Training data inherently has errors, which can come from:
We are not just modeling the function $f(x)$ but actually $f(x+\epsilon)$, where $\epsilon$ refers to potential errors
When we build a model from the training data, we can test it on a testing set

In order to fit the training set better, we may want to use a higher-order polynomial, resulting in the following

However, looking at the predictions on the test set, we see weird predictions being made
<aside> 💡 Our goal is to always generalize our function, such that it is applicable to the real world! The data we collect for training and testing should be representative of the real world (IID - independently and identically distributed)
</aside>
Since we cannot access the data in the testing set, we need to use cross validation
In cross validation, we split the training data into multiple folds, where we hold out one fold as a validation set
For example, if we have 4 folds of data:
| Train | Train | Train | Validation |
|---|---|---|---|
| 1 | 2 | 3 | 4 |
| 2 | 3 | 4 | 1 |
| 3 | 4 | 1 | 2 |
| 4 | 1 | 2 | 3 |
Then from each run, we accumulate the errors and average them out
<aside> 📌 SUMMARY: The best model for us should generalize well on real world data. To simulate this, we use cross validation to get a generalized error value
</aside>