Date: September 12, 2025
Topic: Data Cleaning
Recall
Notes
Data Cleaning
- Data Cleaning Steps: Clean → Transform → Pre-process
Understanding Missing Data
- Missing Completely at Random: Likelihood of any data observation to be missing is random
- Missing at Random: Likelihood of any data observation to be missing depends on observed data features
- Missing Not at Random: Likelihood of any data observation to be missing depends on unobserved outcome
Survey Example
We want to know how affected the population is by depression
- Missing Completely at Random: Someone forgot to fill out survey
- Missing at Random: Statistics show males less likely to do survey (missing data depends on what this group is, e.g., gender)
- Missing Not at Random: Someone did not fill out survey because they have depression
We can choose to remove data but may lose valuable information, or impute missing data by choosing from some statistical method
Handling Missing Data
Remove Missing Data

- Easy but we lose information which could be valuable
Impute Missing Data

- We guess the missing data
- Numerical Data: Mean, mode, most frequent, zero, constant, etc
- Categorical Data: Hot-deck imputation, kNN, deep-learned embeddings
Transform the data based on methods suited for the data type
Transforming Data
Highly dependent on input format
Image
Text
- Index: (apple, orange, pear) → (0, 1, 2)
- Bag of words
- Term-frequency $\times$ inverse document frequency (TF-IDF
- L1 normalization of rows of a matrix
- Text embeddings
Proper data pre-processing allows for faster convergence
Pre-processing Data

- Depending on the type of model, we apply different kinds of pre-processing
- Helps the model to converge faster
Case Study: Depth Perception

- Image above has many “holes” where there are missing depth values
Filling in Depth (Black Pixels)

- Naive approach — use values of nearest neighbor
- Can also do colorization, leverages on in-painting to find missing result
Data Transformation

- From single channel depth map → 3 channels
Pre-processing

- In images, inverse depth helps to improve numerical stability provides a Gaussian error distribution
- In the chart, without normalization training collapses and mean depth goes to 0, however normalization allows us to train for long without collapse
<aside>
📌 SUMMARY: By doing proper data cleaning, transform and pre-processing, we get much better results during model training
</aside>
Date: September 12, 2025
Topic: Managing Bias
<aside>
📌 SUMMARY: We should maintain fairness
</aside>