Date: October 11, 2025
Topic: Soft-max and Preview of Attention
Recall
Soft-max allows us to do random sampling from the input set, with probabilities based on the values of the input set.
Notes
Soft-max
- Attention: Weighting or probability distribution over inputs depending on computational state and inputs
- Allows information to propagate directly between “distant” computational nodes with minimal structural assumptions
- Standard forms of attention is implemented with soft-max
Review of Soft-max
$$
\operatorname{softmax}\!\left(\{x_1,\ldots,x_N\}\right)
= \left\{\frac{e^{x_1}}{Z},\,\ldots,\,\frac{e^{x_N}}{Z}\right\},
\quad
Z=\sum_{j=1}^{N} e^{x_j}.
$$
- By taking soft-max, we get a probability distribution for any inputs that has a total sum of 1
- Soft-max is permutation equivariant (permutation of the input leads to same permutation of the output)
Selection using Soft-max

After applying soft-max, we get the middle distribution. The green index has a high probability of getting chosen when random sampling.
- Soft-max allows us to randomly sample from elements of the original set
- This sampling has a distribution depending on the values of the numbers from the original set
- The larger a value is (higher probability), the more likely we will select that value
Soft-max attention turns dot-product similarities between a query and candidate embeddings into probabilities, then forms a differentiable weighted sum of the candidates
Attention with Soft-max
- With a set of vectors $\{u_1,...,u_N\}$ and a “query” vector $q$, we select the most similar vector to $q$
- This is done with $\hat{j} = \arg\max_{j}\; (\,u_j \cdot q\,)$
- This distribution $p=p_\text{hard}$ has a mass of 0 on all indices except $\hat{j}$
- We want to select a $u_j$ in a way that is differentiable — done through soft-max
- Instead select most similar vector to $q$ via $p=\text{softmax}(Uq)$
- $U$ is the vectors arranged as rows in a matrix
Applying Soft-max to $Uq$

- Probability of selecting the vectors is the soft-max of set of $u_i$’s inner product with $q$
- This allows us to differentiably select a vector from a set
MLP Soft-max
- $q$ here is the last hidden state, where $\{u_1,...,u_n\}$ is the embeddings of the class labels
- Samples from this distribution correspond to labeling (outputs)
Soft-max Attention
- $q$ is the internal hidden state, where $\{u_1,...,u_n\}$ is the embeddings of an “input” like the previous layer
- The distribution corresponds to a summary of $\{u_1,...,u_n\}$
<aside>
📌 SUMMARY:
Soft-max allows us to select, in a differentiable manner, previous layer outputs when fed in as input.
It will be more probable to select the “best” inputs since soft-max gives us a probability distribution.
</aside>
Date: October 11, 2025
Topic: Attention
Recall
Soft attention allows us to select the vector with the highest similarity from a set of vectors $U$, corresponding to the query $q$.
<aside>
📌 SUMMARY:
Through soft attention, we can query a context (apply $q$ on a set of vectors $U$) to understand which entry in the vector corresponds best to $q$.
$q$ is conitnuously updated as the model keeps querying $U$, until we finally have the answer we are looking for.
</aside>
Date: October 30, 2025
Topic: Attention in RNN