Date: May 22, 2024

Topic: Choosing the appropriate model

Recall

Notes

Data inherently comes with noise, so we are modeling the underlying function and any possible noise

The best model is one which approximates the general behavior of the data (train and test) with the lowest errors.

We want to model how data behaves in the real world

To simulate the testing set, we can use cross-validation, where we use different subsets of the data for validation

The model with the lowest accumulated error across the subset would then be the best model

Handling Errors (Noise Sources)

Training data inherently has errors, which can come from:

We are not just modeling the function $f(x)$ but actually $f(x+\epsilon)$, where $\epsilon$ refers to potential errors

Evaluating on the training and testing set

When we build a model from the training data, we can test it on a testing set

Untitled

In order to fit the training set better, we may want to use a higher-order polynomial, resulting in the following

Untitled

However, looking at the predictions on the test set, we see weird predictions being made

<aside> 💡 Our goal is to always generalize our function, such that it is applicable to the real world! The data we collect for training and testing should be representative of the real world (IID - independently and identically distributed)

</aside>

Since we cannot access the data in the testing set, we need to use cross validation

Cross Validation

In cross validation, we split the training data into multiple folds, where we hold out one fold as a validation set

For example, if we have 4 folds of data:

Train Train Train Validation
1 2 3 4
2 3 4 1
3 4 1 2
4 1 2 3

Then from each run, we accumulate the errors and average them out

Picking a model


<aside> 📌 SUMMARY: The best model for us should generalize well on real world data. To simulate this, we use cross validation to get a generalized error value

</aside>