Why Language Models?

Capable of: autocomplete, summarization, translation, chatbots, etc
Can even be multimodal: speech & images
- Use fluency to transcribe speech or to caption images

What Makes a Good Language Model?

Model fluency of a sequence of words rather than every aspect of language production
Fluent: Looks like accurate language
- Does a sequence of words look like fluent language?
- How to choose words such that it looks fluent

In our vocabulary, we might just take the most frequently used word, and use symbols like OOV to indicate the word seen is unknown.

Using random variables, we can achieve fluency by maximizing probability. Thus we want to find a sequence of text $W_1,W_2,...,W_n$ that can do this.

Modeling Fluency

Vocabulary

A set of words that our system knows ($V$)
How many words does a system need to know? (e.g., English has >600,000 words)
- May want to limit to just top 50,000 most frequent words
Special symbols representations
- OOV: Out-of-vocabulary words
- SOS: Start of sequence
- EOS: End of sequence

Modeling Fluent Language

What is the probability that a sequence of words $w_1$, $w_2$, $w_3$, …, $w_n$ would occur in a corpus of text produced by fluent language users
- Fluency can be approximated by probability (fluency $\approx$ probability)
- What is the probability of a sequence of words → $P(w_1, w_2, w_3,...,w_n)$?
Take a bunch of random variables $W_1, W_2, W_3, ...,W_n$ such that each variable $W_i$ can take on the value of a word in the vocab $V$
- Each $W_i$ has a different possible value from the 50,000 words we might find in $V$
- $W_1, W_2, W_3, ...,W_n$ are numbered according to word order

To model fluency, we can use the preceding words to predict the next word by choosing one that has the highest probability.

Example of Modeling Fluency

For a given sentence, assign random variables to each of the word
- E.g., “The moles snuck into the garden last night”
- With chain rule:
$$ \begin{align*}P(W_1 = w_1, W_2 = w_2, \ldots, W_n = w_n) &= P(W_1 = w_1) \cdot P(W_2 = w_2 \mid W_1 = w_1) \\&\quad \cdot P(W_3 = w_3 \mid W_1 = w_1, W_2 = w_2) \\&\quad \cdots \\&\quad \cdot P(W_n = w_n \mid W_1 = w_1, W_2 = w_2, \ldots, W_{n-1} = w_{n-1})\\ &= \prod_{t=1}^{n} P\!\left(W_t = w_t \;\middle|\; W_1 = w_1, W_2 = w_2, \ldots, W_{t-1} = w_{t-1}\right)\end{align*} $$
- $W_1$ has no parent so the probability is only itself, but subsequent words are based on previous words → $W_n$ is based on all previous words that occurred before it
- Hence the probability of a sequence is multiplying the probability of every single word at every single position $t$ by every preceding term

History and Context

$$ P(W_1 = w_1, \ldots, W_n = w_n)= \prod_{t=1}^{n} P\!\left( W_t = w_t \,\middle|\, \underbrace{W_1 = w_1,\, W_2 = w_2,\, \ldots,\, W_{t-1} = w_{t-1}}_{\text{$w_1$, $w_2$, ..., $w_t-1$ is also called the history'' (context'') of the $t$-th word}}\right). $$

However, as sequences get longer, the contexts also get longer
Need to figure out how to squeeze history/context into a representation so the algo can better predict the $t+1$ word
For arbitrarily long sequences, having large (or infinite) random variables is difficult to deal with

<aside> 📌 SUMMARY: The main goal of language generation is in modeling fluency, such that we can achieve text that looks like accurate language. Such fluency can be approximated using probability. Using unigrams, bigrams and n-grams can reasonably approximate fluency, large numbers of $n$ is hard to manage.

</aside>

Date: September 2, 2025

Topic: Language Modeling

Recall

Notes