Data Cleaning

Understanding Missing Data

Missing Completely at Random: Likelihood of any data observation to be missing is random
Missing at Random: Likelihood of any data observation to be missing depends on observed data features
Missing Not at Random: Likelihood of any data observation to be missing depends on unobserved outcome

We want to know how affected the population is by depression

Missing Completely at Random: Someone forgot to fill out survey
Missing at Random: Statistics show males less likely to do survey (missing data depends on what this group is, e.g., gender)
Missing Not at Random: Someone did not fill out survey because they have depression

We can choose to remove data but may lose valuable information, or impute missing data by choosing from some statistical method

Transform the data based on methods suited for the data type

Highly dependent on input format

Index: (apple, orange, pear) → (0, 1, 2)
Bag of words
Term-frequency $\times$ inverse document frequency (TF-IDF
- L1 normalization of rows of a matrix
Text embeddings

Proper data pre-processing allows for faster convergence

Naive approach — use values of nearest neighbor
Can also do colorization, leverages on in-painting to find missing result

In images, inverse depth helps to improve numerical stability provides a Gaussian error distribution
In the chart, without normalization training collapses and mean depth goes to 0, however normalization allows us to train for long without collapse

<aside> 📌 SUMMARY: By doing proper data cleaning, transform and pre-processing, we get much better results during model training

</aside>

<aside> 📌 SUMMARY: We should maintain fairness

</aside>