Date: September 22, 2025
Topic: Introduction to Transformers
Recall
Recurrence (RNNs, LSTMs, Seq2seq) was used to handle variable-length sequences with parameter sharing and compact state, which avoids a giant position-specific network.
However, Transformers allowed for self-attention and massive parallelism, allowing tokens to directly attend to others within a finite context window.
Notes
Previous Architectural Recaps
RNNs and LSTMs

- RNNs process one time-slice at a time, passing the hidden state to the next time slice
- LSTMs improved the encoding into the hidden state
Seq2seq

- Seq2seq collects hidden states from different time steps and lets the decoder choose what to use using attention
- The decoder is side-by-side with the encoder and has its own inputs in addition to the hidden state (token from previous time step)
- The input to the decoder can either come from the previous time slice (e.g., $\hat{y_1}$) or from a teacher forced setting (e.g., $y_1$)
- Encoder: $x$s are inputs, $h$s are outputs
- Decoder: $y$s are inputs, $\hat{y}$s are outputs
- Loss is calculated from the $\hat{y}$s
Reasons for Recurrence
Reasons for Wider Networks
- Originally considered bad as:
- Inflexible with regards to input and output sequence lengths
- Computationally expensive
- However if $n$ can be ridiculously long (~1024 tokens) and we can just pad one-hots with zeros if the input sequence was shorter (for no. 1)
- Also, modern computers are much more capable of parallelization (for no. 2)
<aside>
π SUMMARY: Recurrence handled variable-length sequences via a compact shared state, but Transformers largely replaced it by using self-attention with embeddings and massive parallelism
</aside>
Date: September 22, 2025
Topic: Transformers Concepts
Recall
Unlike recurrent models, transformers have a large input window so we donβt need to generate token-by-token.
Masking is important for transformers**,** using techniques like infilling and continuation.
<aside>
π SUMMARY: Transformers use the concept of large input windows, masking and self-attention
</aside>
Date: September 22, 2025
Topic: Transformer Encoder