Introduction to Classification

Many problems in NLP can be seen as classification problems
- Document topics (e.g., news, politics, etc)
- Sentiment classification (positive or negative review)
- Spam detection, etc

In ML, we don’t have access to all distribution that we need. Hence, we need to have human experts to label data so that we can perform supervised learning.

We first train the model and then use some withheld data to test it.

Features of Classification

Input: word, sentence, paragraph, document, etc
Output: label from a finite set of labels (e.g., topic, sentiment, etc)
Classification is a mapping from $V^* \rightarrow L$ (words to labels)
- $V$ is the input vocab — a set of words
- $V^$ is the set of all possible sequences of words (thus $V^ > V$)
$X$ is a random var. of inputs, such that each value of $X$ is from $V^*$
- $X$ can take on the value of all possible text sequences
$Y$ is a random var. of outputs taken from $l \in L$

Probabilities From Our Data

$P(X,Y)$ is the distribution of labeled texts
- The joint probability with all possible text documents and all possible labels
$P(Y)$ is the distribution of labels
- Irrespective of documents, how frequently we would see each label
- E.g., in movie reviews, generally see more negative than positive reviews

Performing Classification

Problem: We don’t know $P(X,Y)$ or $P(Y)$ except by data
Human experts need to label the data (supervised learning)
Training: Feed data to a supervised ML algorithm approximating the function $\text{classify}:V^* \rightarrow L$
Testing: Apply the learned model $\text{classify}$ to some proportion of withheld data

Text is challenging to perform supervised learning on since it is unstructured. This can be mitigated through feature engineering, where we can insights and create structured pieces of info.

Problems with Classifying Text

$V^*$ is generally considered unstructured (it can be any text)
Supervised learning likes structured inputs (features)
- We want to take unstructured sequences of text and break it down into structured pieces of info
Features are what the algo is allowed to see and know about
- Generally a simplification of the overall set of data
In a perfect world, we throw away all unnecessary parts of input and keep the useful stuff
- Realistically, we don’t know what is useful
- But we can gain some insight through feature engineering

In bag-of-words, we disregard word order and can look at either word presence or word frequency to build this bag. Other definitions are possible as well.

We can use unigrams, bigrams or n-grams to construct this bag.

Bag-of-Words

E.g., $x$ = “The acting was great, but the plot wasn’t so good”

Data is sequential so the order of words should matter

By Word Presence

Features are word presence
Possible set of features $\phi =$ {a, the, acting, great, good, plot, not, was, but, horrible, …}
Unigrams: Features that consists only of a single word
From bag-of-words, any document $x$ is now represented as a $d$-dimensional vector of features

By Frequency

Instead of just whether the word exists or not, we can look at its frequency instead

Possible to define bag-of-words by other ways as well

<aside> 📌 SUMMARY: Classification in NLP can refer to many different tasks like finding topics, finding sentiments, text generation, etc We often don’t know what the true probability of the input and output data are, but we have examples that we can use to learn the true distribution. Text is often unstructured so we need to perform feature engineering (like using bag-of-words) to come up with structures that ML algorithms can learn from In the feature space, we expect examples that have similar values in their features to be close to each other

</aside>

Date: August 30, 2025

Topic: Classification

Recall

Notes