Date: August 30, 2025

Topic: Classification

Recall

Notes

Introduction to Classification


In ML, we don’t have access to all distribution that we need. Hence, we need to have human experts to label data so that we can perform supervised learning.

We first train the model and then use some withheld data to test it.

Features of Classification

Probabilities From Our Data

Performing Classification


Text is challenging to perform supervised learning on since it is unstructured. This can be mitigated through feature engineering, where we can insights and create structured pieces of info.

Problems with Classifying Text


In bag-of-words, we disregard word order and can look at either word presence or word frequency to build this bag. Other definitions are possible as well.

We can use unigrams, bigrams or n-grams to construct this bag.

Bag-of-Words

E.g., $x$ = “The acting was great, but the plot wasn’t so good”

By Word Presence

By Frequency

Possible to define bag-of-words by other ways as well



<aside> 📌 SUMMARY: Classification in NLP can refer to many different tasks like finding topics, finding sentiments, text generation, etc We often don’t know what the true probability of the input and output data are, but we have examples that we can use to learn the true distribution. Text is often unstructured so we need to perform feature engineering (like using bag-of-words) to come up with structures that ML algorithms can learn from In the feature space, we expect examples that have similar values in their features to be close to each other

</aside>


Date: August 30, 2025

Topic: Bayesian Classification