Date: September 19, 2025

Topic: Backwards Pass for Convolution Layer

Recall

For simplification, assume the output and input shape are the same — input has been padded by 2 pixels at bottom and right.

Notes

Backwards Pass for Convolution Layer

Cross-Correlation

Simplifications in calculation:
- 1 channel input
- 1 kernel (i.e., 1 channel output)
- Padding (2 pixels on right and bottom) so that output is the same size

Definitions

Output $y = H \times W$ — due to padding, output and input is same size
$\frac{\partial L}{\partial y}$: Assume $H \times W$ — add padding & change convention for convenience
$\frac{\partial L}{\partial y(r,c)}$: For access element derivative

Back-prop Chain Rule

$\text{Upstream Gradient} = \text{Downstream Gradient} \times \text{Local Gradient}$ — assume backward pass flow where downstream is to inputs and upstream from outputs

Since the kernel is passed through the input image to generate the output image, we need to incorporate all upstream gradients by summing over the gradients of the entire output image.

Due to weight sharing, the same kernel element $k[a',b']$ is used at every spatial location, so its gradients must accumulate contributions from all output locations.

This lets us calculate the weight updates.

Gradients wrt. Weights

This can be done one pixel at a time → ${\partial L}/{\partial k(a,b)}$: e.g., $(a,b)=(0,0)$
Due to weight-sharing, the loss of the weight affects everything
- Initially, it affects input $x$ at $(0,0)$. After striding (e.g., stride = 1), it affects input $x$ at $(0,1)$ and so on.

Chain Rule over All Output Pixels

As each kernel value (e.g., top left kernel value) affects all pixels on output, need to incorporate all upstream gradients
This can be achieved via the chain rule

Calculating $\frac{\partial y(r,c)}{\partial k(a',b')}$ for a Specific Pixel

This is the term we need to compute — partial derivative of output pixel $(r,c)$ wrt. weight on kernel $(a',b')$

Calculating $\frac{\partial L}{\partial k(a',b')}$

Hence, to get the total for the kernel at $(a',b')$, then we need to sum up the gradients over the entire output image

<aside> 📌 SUMMARY: Backwards pass is convolution is the forward is a cross-correlation.

</aside>