Date: October 11, 2025

Topic: Soft-max and Preview of Attention

Recall

Soft-max allows us to do random sampling from the input set, with probabilities based on the values of the input set.

Notes

Soft-max

Attention: Weighting or probability distribution over inputs depending on computational state and inputs
- Allows information to propagate directly between “distant” computational nodes with minimal structural assumptions
Standard forms of attention is implemented with soft-max

Review of Soft-max

\operatorname{softmax}\!\left(\{x_1,\ldots,x_N\}\right) = \left\{\frac{e^{x_1}}{Z},\,\ldots,\,\frac{e^{x_N}}{Z}\right\}, \quad Z=\sum_{j=1}^{N} e^{x_j}.

By taking soft-max, we get a probability distribution for any inputs that has a total sum of 1
Soft-max is permutation equivariant (permutation of the input leads to same permutation of the output)

Selection using Soft-max

After applying soft-max, we get the middle distribution. The green index has a high probability of getting chosen when random sampling.

Soft-max allows us to randomly sample from elements of the original set
This sampling has a distribution depending on the values of the numbers from the original set
- The larger a value is (higher probability), the more likely we will select that value

Soft-max attention turns dot-product similarities between a query and candidate embeddings into probabilities, then forms a differentiable weighted sum of the candidates

Attention with Soft-max

With a set of vectors $\{u_1,...,u_N\}$ and a “query” vector $q$, we select the most similar vector to $q$
- This is done with $\hat{j} = \arg\max_{j}\; (\,u_j \cdot q\,)$
- This distribution $p=p_\text{hard}$ has a mass of 0 on all indices except $\hat{j}$
We want to select a $u_j$ in a way that is differentiable — done through soft-max
- Instead select most similar vector to $q$ via $p=\text{softmax}(Uq)$
- $U$ is the vectors arranged as rows in a matrix

Applying Soft-max to $Uq$

Probability of selecting the vectors is the soft-max of set of $u_i$’s inner product with $q$
This allows us to differentiably select a vector from a set

MLP Soft-max

$q$ here is the last hidden state, where $\{u_1,...,u_n\}$ is the embeddings of the class labels
Samples from this distribution correspond to labeling (outputs)

Soft-max Attention

$q$ is the internal hidden state, where $\{u_1,...,u_n\}$ is the embeddings of an “input” like the previous layer
The distribution corresponds to a summary of $\{u_1,...,u_n\}$

<aside> 📌 SUMMARY: Soft-max allows us to select, in a differentiable manner, previous layer outputs when fed in as input. It will be more probable to select the “best” inputs since soft-max gives us a probability distribution.

</aside>

Date: October 11, 2025

Topic: Attention

Recall

Soft attention allows us to select the vector with the highest similarity from a set of vectors $U$, corresponding to the query $q$.

<aside> 📌 SUMMARY: Through soft attention, we can query a context (apply $q$ on a set of vectors $U$) to understand which entry in the vector corresponds best to $q$. $q$ is conitnuously updated as the model keeps querying $U$, until we finally have the answer we are looking for.

</aside>

Date: October 11, 2025

Topic: Soft-max and Preview of Attention

Recall

Notes

Soft-max

Review of Soft-max

Selection using Soft-max

Attention with Soft-max

Applying Soft-max to $Uq$

MLP Soft-max

Soft-max Attention

Date: October 11, 2025

Topic: Attention

Recall

Date: October 30, 2025

Topic: Attention in RNN