Date: September 22, 2025

Topic: Introduction to Transformers

Recall

Recurrence (RNNs, LSTMs, Seq2seq) was used to handle variable-length sequences with parameter sharing and compact state, which avoids a giant position-specific network.

However, Transformers allowed for self-attention and massive parallelism, allowing tokens to directly attend to others within a finite context window.

Notes

Previous Architectural Recaps

RNNs and LSTMs

image.png

Seq2seq

image.png

Reasons for Recurrence

Reasons for Wider Networks


<aside> πŸ“Œ SUMMARY: Recurrence handled variable-length sequences via a compact shared state, but Transformers largely replaced it by using self-attention with embeddings and massive parallelism

</aside>


Date: September 22, 2025

Topic: Transformers Concepts

Recall

Unlike recurrent models, transformers have a large input window so we don’t need to generate token-by-token.

Masking is important for transformers**,** using techniques like infilling and continuation.


<aside> πŸ“Œ SUMMARY: Transformers use the concept of large input windows, masking and self-attention

</aside>


Date: September 22, 2025

Topic: Transformer Encoder