Date: September 22, 2025

Topic: Introduction to Transformers

Recall

Recurrence (RNNs, LSTMs, Seq2seq) was used to handle variable-length sequences with parameter sharing and compact state, which avoids a giant position-specific network.

However, Transformers allowed for self-attention and massive parallelism, allowing tokens to directly attend to others within a finite context window.

Notes

Previous Architectural Recaps

RNNs and LSTMs

RNNs process one time-slice at a time, passing the hidden state to the next time slice
LSTMs improved the encoding into the hidden state

Seq2seq

Seq2seq collects hidden states from different time steps and lets the decoder choose what to use using attention
The decoder is side-by-side with the encoder and has its own inputs in addition to the hidden state (token from previous time step)
The input to the decoder can either come from the previous time slice (e.g., $\hat{y_1}$) or from a teacher forced setting (e.g., $y_1$)
Encoder: $x$s are inputs, $h$s are outputs
Decoder: $y$s are inputs, $\hat{y}$s are outputs
- Loss is calculated from the $\hat{y}$s

Reasons for Recurrence

RNNs, LSTMs and Seq2seq exists as recurrence avoids having to make a wide neural network with multiple tokens as inputs and outputs
- The above implementation would be very large as each input $x_1,x_2,$ etc are one-hot encoded
- Corresponding $y$ probability vectors would also be very large

Reasons for Wider Networks

Originally considered bad as:
1. Inflexible with regards to input and output sequence lengths
2. Computationally expensive
However if $n$ can be ridiculously long (~1024 tokens) and we can just pad one-hots with zeros if the input sequence was shorter (for no. 1)
Also, modern computers are much more capable of parallelization (for no. 2)

<aside> 📌 SUMMARY: Recurrence handled variable-length sequences via a compact shared state, but Transformers largely replaced it by using self-attention with embeddings and massive parallelism

</aside>

Date: September 22, 2025

Topic: Transformers Concepts

Recall

Unlike recurrent models, transformers have a large input window so we don’t need to generate token-by-token.

Masking is important for transformers**,** using techniques like infilling and continuation.

<aside> 📌 SUMMARY: Transformers use the concept of large input windows, masking and self-attention

</aside>

Date: September 22, 2025

Topic: Introduction to Transformers

Recall

Notes

Previous Architectural Recaps

RNNs and LSTMs

Seq2seq

Reasons for Recurrence

Reasons for Wider Networks

Date: September 22, 2025

Topic: Transformers Concepts

Recall

Date: September 22, 2025

Topic: Transformer Encoder