Date: September 2, 2025
Topic: Language Modeling
Recall
Language models are capable of a wide variety of tasks
Fluency is important for language models
Notes
Why Language Models?
-
Capable of: autocomplete, summarization, translation, chatbots, etc
-
Can even be multimodal: speech & images

- Use fluency to transcribe speech or to caption images
What Makes a Good Language Model?
- Model fluency of a sequence of words rather than every aspect of language production
- Fluent: Looks like accurate language
- Does a sequence of words look like fluent language?
- How to choose words such that it looks fluent
In our vocabulary, we might just take the most frequently used word, and use symbols like OOV to indicate the word seen is unknown.
Using random variables, we can achieve fluency by maximizing probability. Thus we want to find a sequence of text $W_1,W_2,...,W_n$ that can do this.
Modeling Fluency
Vocabulary
- A set of words that our system knows ($V$)
- How many words does a system need to know? (e.g., English has >600,000 words)
- May want to limit to just top 50,000 most frequent words
- Special symbols representations
- OOV: Out-of-vocabulary words
- SOS: Start of sequence
- EOS: End of sequence
Modeling Fluent Language
- What is the probability that a sequence of words $w_1$, $w_2$, $w_3$, …, $w_n$ would occur in a corpus of text produced by fluent language users
- Fluency can be approximated by probability (fluency $\approx$ probability)
- What is the probability of a sequence of words → $P(w_1, w_2, w_3,...,w_n)$?
- Take a bunch of random variables $W_1, W_2, W_3, ...,W_n$ such that each variable $W_i$ can take on the value of a word in the vocab $V$
- Each $W_i$ has a different possible value from the 50,000 words we might find in $V$
- $W_1, W_2, W_3, ...,W_n$ are numbered according to word order
To model fluency, we can use the preceding words to predict the next word by choosing one that has the highest probability.
Example of Modeling Fluency
-
For a given sentence, assign random variables to each of the word
- E.g., “The moles snuck into the garden last night”
- With chain rule:
$$
\begin{align*}P(W_1 = w_1, W_2 = w_2, \ldots, W_n = w_n) &= P(W_1 = w_1) \cdot P(W_2 = w_2 \mid W_1 = w_1) \\&\quad \cdot P(W_3 = w_3 \mid W_1 = w_1, W_2 = w_2) \\&\quad \cdots \\&\quad \cdot P(W_n = w_n \mid W_1 = w_1, W_2 = w_2, \ldots, W_{n-1} = w_{n-1})\\ &= \prod_{t=1}^{n} P\!\left(W_t = w_t \;\middle|\; W_1 = w_1, W_2 = w_2, \ldots, W_{t-1} = w_{t-1}\right)\end{align*}
$$
- $W_1$ has no parent so the probability is only itself, but subsequent words are based on previous words → $W_n$ is based on all previous words that occurred before it
- Hence the probability of a sequence is multiplying the probability of every single word at every single position $t$ by every preceding term
History and Context
$$
P(W_1 = w_1, \ldots, W_n = w_n)= \prod_{t=1}^{n} P\!\left( W_t = w_t \,\middle|\, \underbrace{W_1 = w_1,\, W_2 = w_2,\, \ldots,\, W_{t-1} = w_{t-1}}_{\text{$w_1$, $w_2$, ..., $w_t-1$ is also called the history'' (context'') of the $t$-th word}}\right).
$$
- However, as sequences get longer, the contexts also get longer
- Need to figure out how to squeeze history/context into a representation so the algo can better predict the $t+1$ word
- For arbitrarily long sequences, having large (or infinite) random variables is difficult to deal with
<aside>
📌 SUMMARY:
The main goal of language generation is in modeling fluency, such that we can achieve text that looks like accurate language. Such fluency can be approximated using probability.
Using unigrams, bigrams and n-grams can reasonably approximate fluency, large numbers of $n$ is hard to manage.
</aside>
Date: September 7, 2025
Topic: Neural Language Models