Date: August 31, 2025

Topic: Computation Graph

Recall

Computation graphs allow us to make use of hidden vars. These vars represent the intermediate edges in the graph.

When calculating gradients in the graph for the chain rule, we use local gradients which exist at each graph node.

These are the derivatives of the output edge wrt. the input edge.

Notes

Computation Graph

Models functions into its intermediate steps

Can be broken down into composite steps making up the function
As we move right in the graph, we are building more complexity — until the end where we have the equation
We can use it to model gradients as they flow through the graph (either forwards or backwards)

Making Use of Intermediate Variables

Computation graphs allow us to make use of hidden vars.
Intermediate variables represent the intermediate edges in the graph (we don’t need labels for input and output edges)
When we calculate gradients in the graph for chain rule, we generally look at local gradients
- Local gradients exist at each node in the graph
- These gradients are usually the derivative of the output edge wrt. the input edge
  - e.g., in the $\times$ node, the local gradient is $\frac{\partial f}{\partial a}$ and in the $+$ node, the local gradient is $\frac{\partial a}{\partial x}$
- These gradients can be combined to get the overall chain rule product

In forward-mode, the gradients are passed forward through the network.

In reverse-mode, the gradients start at the output instead and flow back to the input.

Automatic Differentiation (Autodiff)

Forward-mode Autodiff

Gradients passed forward as we go through network
Usually if small num. of inputs and large num. of outputs

Reverse-mode Autodiff

Gradients start from output and flow back to the input
We use this mode for DNN tasks as the number of inputs are much greater than the number of outputs.

In Forward Mode, we continuously apply the chain rule on the Next Forward Gradient

This is useful for small inputs → large outputs, as the Jacobians we compute from the small inputs is easier to deal with. The Jacobian can be simultaneously solved.

<aside> 📌 SUMMARY: Auto-differentiation uses DAGs, where we perform pairwise multiplication on primitives of differentials to calculate gradients.

</aside>