Date: August 30, 2025
Topic: Classification
Recall
Notes
Introduction to Classification
- Many problems in NLP can be seen as classification problems
- Document topics (e.g., news, politics, etc)
- Sentiment classification (positive or negative review)
- Spam detection, etc
In ML, we don’t have access to all distribution that we need. Hence, we need to have human experts to label data so that we can perform supervised learning.
We first train the model and then use some withheld data to test it.
Features of Classification
- Input: word, sentence, paragraph, document, etc
- Output: label from a finite set of labels (e.g., topic, sentiment, etc)
- Classification is a mapping from $V^* \rightarrow L$ (words to labels)
- $V$ is the input vocab — a set of words
- $V^$ is the set of all possible sequences of words (thus $V^ > V$)
- $X$ is a random var. of inputs, such that each value of $X$ is from $V^*$
- $X$ can take on the value of all possible text sequences
- $Y$ is a random var. of outputs taken from $l \in L$
Probabilities From Our Data
- $P(X,Y)$ is the distribution of labeled texts
- The joint probability with all possible text documents and all possible labels
- $P(Y)$ is the distribution of labels
- Irrespective of documents, how frequently we would see each label
- E.g., in movie reviews, generally see more negative than positive reviews
Performing Classification
- Problem: We don’t know $P(X,Y)$ or $P(Y)$ except by data
- Human experts need to label the data (supervised learning)
- Training: Feed data to a supervised ML algorithm approximating the function $\text{classify}:V^* \rightarrow L$
- Testing: Apply the learned model $\text{classify}$ to some proportion of withheld data
Text is challenging to perform supervised learning on since it is unstructured. This can be mitigated through feature engineering, where we can insights and create structured pieces of info.
Problems with Classifying Text
- $V^*$ is generally considered unstructured (it can be any text)
- Supervised learning likes structured inputs (features)
- We want to take unstructured sequences of text and break it down into structured pieces of info
- Features are what the algo is allowed to see and know about
- Generally a simplification of the overall set of data
- In a perfect world, we throw away all unnecessary parts of input and keep the useful stuff
- Realistically, we don’t know what is useful
- But we can gain some insight through feature engineering
In bag-of-words, we disregard word order and can look at either word presence or word frequency to build this bag. Other definitions are possible as well.
We can use unigrams, bigrams or n-grams to construct this bag.
Bag-of-Words
E.g., $x$ = “The acting was great, but the plot wasn’t so good”
- Data is sequential so the order of words should matter
By Word Presence
- Features are word presence
- Possible set of features $\phi =$ {a, the, acting, great, good, plot, not, was, but, horrible, …}
- Unigrams: Features that consists only of a single word
- From bag-of-words, any document $x$ is now represented as a $d$-dimensional vector of features
By Frequency
- Instead of just whether the word exists or not, we can look at its frequency instead
Possible to define bag-of-words by other ways as well
<aside>
📌 SUMMARY:
Classification in NLP can refer to many different tasks like finding topics, finding sentiments, text generation, etc
We often don’t know what the true probability of the input and output data are, but we have examples that we can use to learn the true distribution.
Text is often unstructured so we need to perform feature engineering (like using bag-of-words) to come up with structures that ML algorithms can learn from
In the feature space, we expect examples that have similar values in their features to be close to each other
</aside>
Date: August 30, 2025
Topic: Bayesian Classification