The supervised ML algorithms mentioned so far are commonly used an implemented everywhere. The algorithms that are more commonly used in "general AI" are known as deep learning algorithms. The use of these algorithms through neural nets resemble neurons in the brain and require much more computing power than regular ML algorithms, generally not able to be run on commercial computers.
Supervised Learning with Non-Linear Models
In the supervised learning setting, our model/hypothesis is $h_\theta (x)$, which we have considered to be a linear model as with the case when $h_\theta (x) = \theta^T x$ in linear regression, $h_\theta(x) = g(\theta^T x)$ in logistic regression, or $h_\theta(x) = \theta^T \phi(x)$ in a linear regression of new feature variables. Next, we will learn about a general family of models that are non-linear in both the parameters $\theta$ and the inputs $x$.
Suppose $\{(x^{(i)}, y^{(i)})\}_{i=1}^n$ are the training samples with $y^{(i)} \in \mathbb{R}$. Then, we will try to find some non-linear function $h_\theta: \mathbb{R}^d \longrightarrow \mathbb{R}$ that best fits the data. Note that we don't even know what form this function is in. Just like the linear model, the cost/loss function for point $x^{(i)}$ is really just the square of the residual, scaled by $0.5$.
\[J^{(i)} (\theta) \equiv \frac{1}{2} \big( h_\theta (x^{(i)}) - y^{(i)} \big)^2\]
and the mean-square cost function for the whole dataset is
\[J(\theta) \equiv \frac{1}{n} \sum_{i=1}^n J^{(i)} (\theta)\]
To minimize this, we can use batch GD with update rule
\[\theta = \theta - \alpha \nabla_\theta J(\theta)\]
Often, using batch GD can be faster than computing the partials for each component function due to hardware parallelization, so a mini-batch version of SGD is most commonly used in deep learning.
In
mini-batch Stochastic Gradient Descent, we choose the learning rate (step size) $\alpha$, batch size $B$, and the number of iterations $n_{iter}$. We initialize $\theta$ randomly, and we do the following $n_{iter}$ times: We sample $B$ examples $j_1, \ldots, j_B$ (without replacement) uniformly from $\{1, \ldots, n \}$ and update $\theta$ by
\[\theta = \theta - \frac{\alpha}{B} \sum_{k=1}^B \nabla_\theta J^{(j_k)} (\theta)\]
Activation Functions: ReLU Function
Let us define our first non-linear function. Given $h_{\omega, b} \equiv \omega^T x + b$, we can compose it with an
activation function $\sigma: \mathbb{R} \longrightarrow \mathbb{R}$ (both domain and codomain must be $\mathbb{R}$). One such activation function is the
ReLU function
\[\text{ReLU}: \mathbb{R} \longrightarrow \mathbb{R}, \;\;\; \text{ReLU}(z) \equiv \max\{z, 0\}\]
Composing the two gives
\[h_\theta (x) \equiv \text{ReLU}(\omega^T x + b), \;\;\;\;\; \text{ where } \;\; \omega \in \mathbb{R}^d, b \in \mathbb{R}\]
This $\text{ReLU}$ function is used to avoid any negative values in our output (e.g. if we're predicting house price as a function of area and number of bedrooms, it wouldn't make sense to have negative values). Since $\text{ReLU}$ is defined only for scalar values $z$, we extend this definition for vector values $z \in \mathbb{R}^n$ to get
\[\text{ReLU}(z) \equiv \text{ReLU} \begin{pmatrix} z_1 \\ z_2 \\ \vdots \\ z_n \end{pmatrix} \equiv \begin{pmatrix} \text{ReLU}(z_1) \\ \text{ReLU}(z_2) \\ \vdots \\ \text{ReLU}(z_n) \end{pmatrix} \]
Other activation functions $\sigma: \mathbb{R} \longrightarrow \mathbb{R}$, which should obviously be non-linear, include the sigmoid and hyperbolic tangent functions:
\begin{align*}
\sigma (z) \equiv \frac{1}{1 + e^{-z}} & \implies \frac{1}{1 + e^{-(\omega^T x + b)}} \\
\sigma (z) \equiv \frac{e^{z} - e^{-z}}{e^z + e^{-z}} & \implies h_\theta (x) = \frac{e^{\omega^T x + b} - e^{-(\omega^T x + b)}}{e^{\omega^T x + b} + e^{-(\omega^T x + b)}}
\end{align*}
The reason $\sigma$ must be non-linear is that if it was indeed linear, then by the fact that the composition of linear maps are linear, the entire neural network would just be one big composition of linear maps and therefore be one big linear map itself. This loses most of the representational power of the nerual network, since the output we are trying to predict has a non-linear relationship with the inputs. Without non-linear activation functions the neural network will simply perform linear regression.
Neural Networks
Since we will be using both notations $\theta$ and $\omega, b$, we should keep in mind that
\[\theta \equiv \begin{pmatrix} \omega \\ b \end{pmatrix} \in \mathbb{R}^d \oplus \mathbb{R}\]
and call $\omega, b$ the weight vector and the bias.
A
neuron is defined as a function (visualized as a node) that takes in inputs $x$ and outputs a $y$ calculated thorugh a nonlinear function (which for example, may be defined $h_\theta (x) \equiv \text{ReLU}(\omega^T x + b)$). In the visual below, we see a node $h: \mathbb{R}^3 \longrightarrow \mathbb{R}$, with the nodes representing the inputs/outputs and the three arrows representing the function. Each arrow pointing from $x_i \rightarrow y$ means that $x_i$ is a relevant variable in calculating $y$, so we can see that the neuron does not consider $x_2$ relevant and does not implement it in its function.
If there does not exist any arrow from a potential input $x$ to an output $y$, then this means that $x$ is not relevant in calculating $y$. However, we usually work with
fully-connected neural networks, which means that every input is relevant to calculating every output, since we usually cannot make assumptions about which variables are relevant or not.
We can stack multiple neurons such that one neuron passes its output as input into the next neuron, resulting in a more complex function. What we have seen just now is a 1-layer neural network. The
fully-connected 2-layer neural network of $d$ input features $x_1, \ldots, x_d$ and one scalar output $y \in \mathbb{R}$ can be visualized below. It has one
hidden layer with $m$ input values $a_1, a_2, \ldots, a_m$.
Note that each $a_i$ from the hidden layer is calculated (assuming that the mapping $x \mapsto a_i$ is the linear ReLU map, but could also be other non-linear maps) with the following system:
\begin{align*}
a_1 & = \text{ReLU}(\omega_1^T x + b_1) \\
a_2 & = \text{ReLU}(\omega_2^T x + b_2) \\
\ldots & = \ldots \\
a_m & = \text{ReLU}(\omega_m^T x + b_m)
\end{align*}
where each calculation $x \mapsto a_1, \; x \mapsto a_2, \; \ldots, x \mapsto a_m$ can be visualized below.
We can vectorize all this and rewrite it in an equivalent matrix form (to prevent slow for-looping in our process):
\[a = \text{ReLU}\big( W x + b \big) = \text{ReLU} \Bigg(
\begin{pmatrix} — & \omega_1^T & — \\ — & \omega_2^T & — \\ \vdots & \vdots & \vdots \\ — & \omega_m^T & — \end{pmatrix} \begin{pmatrix}
x_1 \\ x_2 \\ \ldots \\ x_d \end{pmatrix} + \begin{pmatrix} d_1 \\ d_2 \\ \vdots \\ d_m \end{pmatrix}
\Bigg)\]
Now, let us transition to a multi-layer fully-connected neural network. It has:
- input values $x \in \mathbb{R}^d$.
- an output value $y \in \mathbb{R}$
- $r$ layers (aka $r-1$ hidden layers), with
- the zeroth layer being equivalent to the input vector $x$, of dimension $r_0 = d$.
- the first hidden layer being a $m_1$-dimensional vector $a^{[1]}$, with elements $a_1^{[1]}, a_2^{[1]}, \ldots, a_{m_1}^{[1]}$.
- the second hidden layer being a $m_2$-dimensional vector $a^{[2]}$, with elements $a_1^{[2]}, a_2^{[2]}, \ldots, a_{m_2}^{[2]}$.
- ...
- the $r-1$th hidden layer being a $m_{r-1}$-dimensional vector $a^{[r-1]}$, with elements $a_1^{[r-1]}, a_2^{[r-1]}, \ldots, a_{m_{r-1}}^{[r-1]}$.
$W^{[i]}$ is the
weight matrix and $b^{[i]}$ is the bias used to transition from layer $i=1$ to layer $i$.
\begin{align*}
x & = a^{[0]} \\
a^{[1]} & = \text{ReLU}\big( W^{[1]} a^{[0]} + b^{[1]} \big) \\
a^{[2]} & = \text{ReLU}\big( W^{[2]} a^{[1]} + b^{[2]} \big) \\
\ldots & = \ldots \\
a^{[r-1]} & = \text{ReLU}\big( W^{[r-1]} a^{[r-2]} + b^{[r-1]} \big) \\
a^{[r]} & = W^{[r]} a^{[r-1]} + b^{[r]} \\
y & = a^{[r]}
\end{align*}
The total number of neurons in the network is simply the sum of the dimensions of each vector representing each layer:
\[\sum_{i=1}^r m_r\]
and the total number of parameters in this network is the number of elements in the $r$ pairs of weight matrices and biases $\big( W^{[i]}, b^{[i]}\big)$:
\[(d+1) m_1 + (m_1 + 1) m_2 + \ldots + (m_{r-1} + 1) m_r\]
Backpropagation for 2-Layer Neural Networks
When it comes to finding gradients, we can easily derive it for simple linear regression since the hypothesis function $h_\theta$ and its cost function $J(\theta)$ is simple. Our neural network also attempts to predict some hypothesis function $h_\theta (x)$ given some training samples
\[\big\{(x^{(i)}, y^{(i)}) \; \big|\; x^{(i)} \in \mathbb{R}^d, \; y^{(i)} \in \mathbb{R} \big\}_{i=1}^n\]
We can do this by minimizing the cost function
\[J(\theta) \equiv \frac{1}{n} \sum_{i=1}^n J^{(i)} (\theta) \equiv \frac{1}{n} \sum_{i=1}^n \frac{1}{2} \big( h_\theta (x^{(i)}) - y^{(i)} \big)^2\]
But the thing is that for a neural network, $h_\theta (x)$ is a very complicated composition of nonlinear functions with many parameters. More precisely, an $r$-layered fully-connected neural network of layer dimensionality $d, m_1, m_2, \ldots, m_{r-1}$ (as in the diagram above) will have
\[\theta = (d+1) m_1 + (m_1 + 1) m_2 + \ldots + (m_{r-1} + 1) m_r\]
parameters, which can all be encoded in the weight matrices and biases $W^{[1]}, W^{[2]}, \ldots, W^{[r]}$ and $b^{[1]}, b^{[2]}, \ldots, b^{[r]}$. So, optimizing the parameters $\theta$ in a neural network is really equivalent to optimizing the $W^{[i]}$'s and $b^{[i]}$'s. Therefore, the challenge is now to find some efficient way to calculate the gradient of this complex function $h_\theta$, which may have thousands, or even millions, of parameters. Fortunately, we can use
backpropagation to compute this gradient and then implement it in a gradient descent algorithm to minimize $J$. Note that backpropagation and GD not the same (although they are sometimes used interchangeably)! Backpropagation is just a method to find a gradient, whereas GD utilizes that gradient to move $\theta$ in the direction of steepest descent. One can say that backpropagation is a subalgorithm of GD.
We consider the 2-layer neural network mentioned before, with $d$ input attributes and one hidden layer of $m$ scalar attriubtes. We can interpret the calculation of the cost function $J$ as the following composition of mappings:
\[x \xrightarrow{W^{[1]} x + b^{[1]}} z \xrightarrow{ReLU(z)} a \xrightarrow{W^{[2]} a + b^{[2]}} o \xrightarrow{\frac{1}{2} (y - o)^2} J\]
Recall the multivariate chain rule (full theorem
here): Let $f: \mathbb{R}^n \longrightarrow \mathbb{R}^m$ and $g: \mathbb{R}^p \longrightarrow \mathbb{R}^n$ be 2 functions such that $f \circ g: \mathbb{R}^p \longrightarrow \mathbb{R}^m$ is well-defined. Then,
\[\underset{m \times p}{D (f \circ g)} = \underset{m \times n}{D f} \cdot \underset{n \times p}{D g}\]
From this we can state
Lemma 1: Given some smooth $J: \mathbb{R}^m \longrightarrow \mathbb{R}$ with $z \in \mathbb{R}^m, W \in \mathbb{R}^{m \times d}, u \in \mathbb{R}^d, b \in \mathbb{R}^m$ such that
\begin{align*}
z & = W u + b \\
J & = J(z)
\end{align*}
then
\[\underset{m \times d}{\frac{\partial J}{\partial W}} = \underset{m \times 1}{\frac{\partial J}{\partial z}} \cdot \underset{1 \times d}{u^T} \;\;\;\;\;\;\; \text{ and } \;\;\;\;\;\; \underset{m \times 1}{\frac{\partial J}{\partial b}} = \underset{m \times 1}{\frac{\partial J}{\partial z}}\]
To calculate the partials of $J$ with respect to $W^{[2]}$ and $b^{[2]}$, we can take the function $W^{[2]} \mapsto J$ and split it up into the composition of the two functions:
\begin{align*}
o & = W^{[2]} a + b^{[2]} \\
J & = \frac{1}{2} \big(y - o\big)^2
\end{align*}
and directly apply lemma 1 to get
\begin{align*}
\frac{\partial J}{\partial W^{[2]}} & = \frac{\partial J}{\partial o} \cdot \frac{\partial o}{\partial W^{[2]}} = (o - y) \cdot a^T \\
\frac{\partial J}{\partial b^{[2]}} & = \frac{\partial J}{\partial o} = (o - y)
\end{align*}
We now introduce
Lemma 2: Given some smooth $J: \mathbb{R}^m \longrightarrow \mathbb{R}$ with $z, a \in \mathbb{R}^m$ and activation function $\sigma: \mathbb{R} \longrightarrow \mathbb{R}$ such that
\begin{align*}
a & = \sigma(z) \;\;\;\; (\text{ where } \sigma \text{ is an element-wise action})\\
J & = J(a)
\end{align*}
Then, we have (where $\odot$ denotes component-wise product):
\[\frac{\partial J}{\partial z} = \frac{\partial J}{\partial a} \odot \sigma^\prime (z) = \begin{pmatrix} \partial J/ \partial a_1 \\ \vdots \\ \partial J/ \partial a_m \end{pmatrix} \odot \begin{pmatrix} \sigma^\prime (z_1) \\ \vdots \\ \sigma^\prime (z_m) \end{pmatrix} = \begin{pmatrix}
\frac{\partial J}{\partial a_1} \cdot \sigma^\prime (z_1) \\ \vdots \\ \frac{\partial J}{\partial a_m} \cdot \sigma^\prime (z_m) \end{pmatrix}\]
Now using these two lemmas, we can find the partials of $J$ with respect to $W^{[1]}$ and $b^{[1]}$. By splitting up the composition $x \mapsto J$ into $x \mapsto z \mapsto J$ as below and using lemma 1, we get
\[\begin{cases} z = W^{[1]} x + b^{[1]} \\
J = J(z) \end{cases} \implies \begin{cases}
\frac{\partial J}{\partial W^{[1]}} = \frac{\partial J}{\partial z} \cdot x^T \\
\frac{\partial J}{\partial b^{[1]}} = \frac{\partial J}{\partial z} \end{cases}\]
Now we just have to find $\partial J/\partial z$. By splitting up $z \mapsto J$ into $z \mapsto a \mapsto J$ as below and using lemma 2, we get
\[\begin{cases} a = \sigma(z) \\ J = J(a) \end{cases} \implies \frac{\partial J}{\partial z} = \frac{\partial J}{\partial a} \odot \sigma^\prime (z)\]
Combining these two expressions gives
\begin{align*}
\frac{\partial J}{\partial W^{[1]}} & = \frac{\partial J}{\partial z} \cdot x^T = \bigg( \frac{\partial J}{\partial a} \odot \sigma^\prime (z) \bigg) \cdot x^T \\
\frac{\partial J}{\partial b^{[1]}} & = \frac{\partial J}{\partial z} = \frac{\partial J}{\partial a} \odot \sigma^\prime (z)
\end{align*}
By finally splitting up $a \mapsto J$ into $a \mapsto o \mapsto J$, we get
\[\begin{cases}
o = W^{[2]} a + b^{[2]} \\
J = \frac{1}{2} (y - o)^2 \end{cases} \implies \frac{\partial J}{\partial a} = \frac{\partial J}{\partial o} \; \frac{\partial o}{\partial a} = (o - y)\, \big(W^{[2]}\big)^T \]
Therefore,
\begin{align*}
\frac{\partial J}{\partial W^{[1]}} & = \bigg( \frac{\partial J}{\partial a} \odot \sigma^\prime (z) \bigg) \cdot x^T \\
& = \bigg( (o - y) \big(W^{[2]}\big)^T \odot \sigma^\prime (z) \bigg) \cdot x^T = (o - y) \; \begin{pmatrix}
W_1^{[2]} \cdot 1\{z_1 \geq 0\} \\ \vdots \\ W_m^{[2]} \cdot 1\{z_m \geq 0\}
\end{pmatrix} \cdot x^T \\
\frac{\partial J}{\partial b^{[1]}} & = \frac{\partial J}{\partial a} \odot \sigma^\prime (z) \\
& = (o - y) \big(W^{[2]}\big)^T \odot \sigma^\prime (z) = (o - y) \; \begin{pmatrix}
W_1^{[2]} \cdot 1\{z_1 \geq 0\} \\ \vdots \\ W_m^{[2]} \cdot 1\{z_m \geq 0\}
\end{pmatrix}
\end{align*}
These four partials
\[\frac{\partial J}{\partial W^{[1]}}, \frac{\partial J}{\partial W^{[2]}}, \frac{\partial J}{\partial b^{[1]}}, \frac{\partial J}{\partial b^{[2]}}\]
are all we need to construct the gradient. Therefore, given that we have a neural network (with a given $\sigma$) with some training sample $(x, y)$ and are initialized at some point $\big(W^{[1]}, W^{[2]}, b^{[1]}, b^{[2]}\big)$ in the parameter space, we can take all these (given) points/values and compute $z = W^{[1]} x + b^{[1]}$, then $a = \sigma (z)$, then $o = W^{[2]} a + b^{[2]}$. These can be used to then compute
\begin{align*}
\frac{\partial J}{\partial W^{[1]}} & = (o - y) \, \Big( (W^{[2]})^T \odot \sigma^\prime (z) \Big) \cdot x^T \\
\frac{\partial J}{\partial W^{[2]}} & = (o - y) \cdot a^T \\
\frac{\partial J}{\partial b^{[1]}} & = (o - y) \, (W^{[2]})^T \odot \sigma^\prime (z) \\
\frac{\partial J}{\partial b^{[2]}} & = (o - y)
\end{align*}
Computing these four partials basically computes the gradient of $J$ at the point $\big(W^{[1]}, W^{[2]}, b^{[1]}, b^{[2]}\big)$, and so traveling in the direction of this vector:
\[- \nabla J(\theta) \equiv -\begin{pmatrix} \partial J/\partial W^{[1]} \\ \partial J/\partial W^{[2]} \\ \partial J/\partial b^{[1]} \\ \partial J/\partial b^{[2]} \end{pmatrix} \in \mathbb{R}^{md} \oplus \mathbb{R}^m \oplus \mathbb{R}^m \oplus \mathbb{R}\]
is traveling in the steepest descent. Note that $md, m, m, 1$ is the dimension of $W^{[1]}, W^{[2]}, b^{[1]}, b^{[2]}$, respectively.
Backpropagation for Multi-Layer Neural Networks
We generalize our backpropagation algorithms for multilayered neural networks. With the notation $a^{[0]} = x$ and $a^{[r]} = z^{[r]} = h_\theta (x)$ for simplicity, we have
\begin{align*}
a^{[1]} & = \text{ReLU}\big(W^{[1]} a^{[0]} + b^{[1]} \big) \\
a^{[2]} & = \text{ReLU} \big( W^{[2]} a^{[1]} + b^{[2]} \big) \\
\ldots & = \ldots \\
a^{[r-1]} & = \text{ReLU} \big( W^{[r-1]} a^{[r-2]} + b^{[r-1]} \big) \\
a^{[r]} & = z^{[r]} = W^{[r]} a^{[r-1]} a^{[r-1]} + b^{[r]} \\
J & = \frac{1}{2} \big( a^{[r]} - y \big)^2
\end{align*}
Note that this entire algorithm from one point $x \rightarrow h_\theta (x) \rightarrow J$ can be broken down into the following steps. Note that the $J$ at the end does not represent the mean-square cost function (of all datapoints) but just the cost function for the singular point $(x, y)$.
\begin{align*}
x & \xrightarrow{W^{[1]} x + b^{[1]}} z^{[1]} \xrightarrow{\text{ReLU}(z^{[1]})} a^{[1]} \\
& \xrightarrow{W^{[2]} a^{[1]} + b^{[2]}} z^{[2]} \xrightarrow{\text{ReLU}(z^{[2]})} a^{[2]}\\
& \xrightarrow{W^{[3]} a^{[2]} + b^{[3]}} z^{[3]} \xrightarrow{\text{ReLU}(z^{[3]})} a^{[3]}\\
& \ldots \\
& \xrightarrow{W^{[r-2]} a^{[r-3]} + b^{[r-2]}} z^{[r-2]} \xrightarrow{\text{ReLU}(z^{[r-2]})} a^{[r-2]}\\
& \xrightarrow{W^{[r-1]} a^{[r-2]} + b^{[r-1]}} z^{[r-1]} \xrightarrow{\text{ReLU}(z^{[r-1]})} a^{[r-1]}\\
& \xrightarrow{W^{[r]} a^{[r-1]} + b^{[r]}} z^{[r]} \xrightarrow{a^{[r]} = z^{[r]}} a^{[r]}\\
& \xrightarrow{\frac{1}{2} (y - a^{[r]})^2} J
\end{align*}
But note that we can generalize the mappings $a^{[k-1]} \mapsto z^{[k]}$ and $z^{[k]} \mapsto J$ as such on the left, which implies by lemma 1 that
\[\begin{cases}
z^{[k]} = W^{[k]} a^{[k-1]} + b^{[k]} \\
J = J(z^{[k]}) \end{cases} \implies \begin{cases}
\frac{\partial J}{\partial W^{[k]}} = \frac{\partial J}{\partial z^{[k]}} \cdot a^{[k-1]\, T} \\
\frac{\partial J}{\partial b^{[k]}} = \frac{\partial J}{\partial z^{[k]}}
\end{cases} \;\; \text{ for all } k = 1, 2, \ldots, r\]
Therefore, computing all the $\partial J / \partial z^{[k]}$'s will allow us to compute the above for all $k$, allowing us to get the gradient. The initial term where $k=r$ can be solved easily:
\[J(z^{[r]}) = \frac{1}{2} (y - z^{[r]}) \implies \frac{\partial J}{\partial z^{[r]}} = \big(z^{[r]} - y \big)\]
What we would like to do is design some formula that calculates the $\frac{\partial J}{\partial z^{[k]}}$'s iteratively backwards from $k = r, r-1, \ldots, 2, 1$. This is often called the
backward press. By lemma 2, we can take the mappings $z^{[k]} \mapsto a^{[k]}$, and get
\[\begin{cases}
a^{[k]} = \text{ReLU}(z^{[k]}) \\
J = J(a^{[k]})
\end{cases} \implies
\frac{\partial J}{\partial z^{[k]}} = \frac{\partial J}{\partial a^{[k]}} \odot \text{ReLU}^\prime (z^{[k]})\]
Now, we must find $\frac{\partial J}{\partial a^{[k]}}$, which must be put into a form of $z^{[k+1]}$. The relationship between $a^{[k]}$ and $z^{[k+1]}$ can be written as below, and calculating the derivative with the chain rule gives:
\[\begin{cases}
z^{[k+1]} = W^{[k+1]} a^{[k]} + b^{[k+1]} \\
J = J (z^{[k+1]})
\end{cases} \implies \underset{m_k \times 1}{\frac{\partial J}{\partial a^{[k]}}} = \underset{m_k \times m_{k+1}}{W^{[k+1]\,T}} \, \underset{m_{k+1} \times 1}{\frac{\partial J}{\partial z^{[k+1]}}}\]
For those that are confused on why we are taking the transpose of $W^{[k+1]}$, recall that given a function $f: \mathbb{R}^n \longrightarrow \mathbb{R}$, its total derivative at a point $x_0$ is a linear mapping $df (x_0): \mathbb{R}^n \longrightarrow \mathbb{R}$ that is the
best linear approximation of $f$ at $x_0$. Stepping away from the abstraction, the matrix realization of $df(x_0)$ is a $1 \times n$ matrix (i.e. a row vector) that we will call $Df(x_0)$.
\[Df(x_0) = \begin{pmatrix} \frac{\partial f}{\partial x_1} & \frac{\partial f}{\partial x_2} & \ldots & \frac{\partial f}{\partial x_n} \end{pmatrix}\]
We write it as a row vector because the action of $Df(x_0)$ onto a certain point $x^*$ is written in terms of left-matrix multiplication.
\[Df(x_0) (x^*) = Df(x_0) x^* = \begin{pmatrix} \frac{\partial f}{\partial x_1} & \ldots & \frac{\partial f}{\partial x_n} \end{pmatrix} \begin{pmatrix} x^*_1 \\ \vdots \\ x^*_n \end{pmatrix}\]
Similarly, a function $f: \mathbb{R}^n \longrightarrow \mathbb{R}^m$ has the total derivative $Df(x_0): \mathbb{R}^n \longrightarrow \mathbb{R}^m$, with a realization as a $m \times n$ matrix. Now, given the equations above, we can see that through the chain rule,
\[\underset{1 \times m_k} {\frac{\partial J}{\partial a^{[k]}}} = \underset{1 \times m_{k+1}}{\frac{\partial J}{\partial z^{[k+1]}}} \,
\underset{m_{k+1} \times m_k}{\frac{\partial z^{[k+1]}}{\partial a^{[k]}}} \implies \frac{\partial J}{\partial a^{[k]}} = \frac{\partial J}{\partial z^{[k+1]}} \, W^{[k+1]}\]
but within this context of deep learning, we have used
column vectors to represent derivatives, so taking the transpose of the entire equation gives
\[\Bigg( \frac{\partial J}{\partial a^{[k]}} \Bigg)^T =\Bigg(\frac{\partial J}{\partial z^{[k+1]}} \, W^{[k+1]}\Bigg)^T \implies \Bigg( \frac{\partial J}{\partial a^{[k]}} \Bigg)^T = W^{[k+1]\,T} \Bigg(\frac{\partial J}{\partial z^{[k+1]}}\Bigg)^T\]
Now back to the main problem: we are done! Starting from $k=r$, we first compute
\begin{align*}
\frac{\partial J}{\partial W^{[r]}} & \, = \frac{\partial J}{\partial z^{[r]}} a^{[r-1]\,T} = \big( z^{[r]} - y\big) a^{[r-1]\,T} \\
\frac{\partial J}{\partial b^{[r]}} & = \frac{\partial J}{\partial z^{[r]}} \;\;\;\;\;\;\;\;\;\;\; = \big( z^{[r]} - y\big)
\end{align*}
and then use the iterative algorithm to compute for $k=r-1, \ldots, 2, 1$.
\begin{align*}
\frac{\partial J}{\partial W^{[k]}} & \, = \frac{\partial J}{\partial z^{[k]}} a^{[r-1]\,T} = \bigg( W^{[k+1]\,T} \, \frac{\partial J}{\partial z^{[k+1]}} \bigg) \odot \text{ReLU}^\prime \big( z^{[k]} \big) a^{[k-1]\,T} \\
\frac{\partial J}{\partial b^{[k]}} & = \frac{\partial J}{\partial z^{[k]}} \;\;\;\;\;\;\;\;\;\;\; = \bigg( W^{[k+1]\,T} \, \frac{\partial J}{\partial z^{[k+1]}} \bigg) \odot \text{ReLU}^\prime \big( z^{[k]} \big)
\end{align*}
Summary: Backpropogation of Multilayer Neural Nets (with Vectorization)
We summarize the entire backpropogation algorithm. Given that we have an $r$-layer neural network (with a given $\sigma$) $h_\theta: \mathcal{X} = \mathbb{R}^d \longrightarrow \mathcal{Y} = \mathbb{R}$ with some set of $n$ $d$-dimensional training samples
\[\big\{ (x^{(i)}, y^{(i)}) \big\}_{i=1}^n\]
we would like to find the optimal parameters
\[\theta \equiv \big( W^{[1]}, W^{[2]}, \ldots, W^{[r]}, \; b^{[1]}, b^{[2]}, \ldots, b^{[r]}\big) \]
such that the mean-square cost function
\[J(\theta) \equiv \frac{1}{n} \sum_{i=1}^n J^{(i)}(\theta) = \frac{1}{n} \sum_{i=1}^n \frac{1}{2} \big( h_\theta(x^{(i)}) - y^{(i)} \big)^2\]
is minimized. We can optimize $\theta$ using gradient descent, and fortunately, we can indeed efficiently calculate this gradient of the much more complex, nonlinear neural network. Without loss of generality, we will assume the gradient of the form
\[\nabla_\theta J \equiv \begin{pmatrix}
\partial J / \partial W^{[1]} \\
\partial J / \partial W^{[2]} \\
\vdots \\
\partial J / \partial W^{[r]} \\
\partial J / \partial b^{[1]} \\
\vdots \\
\partial J / \partial b^{[r]}
\end{pmatrix}\]
Note that the backpropagation algorithm only works for when there is a singular training sample $(x^{(i)}, y^{(i)})$ (as opposed to a set of $n$) with a cost function defined $J^{(i)} (\theta) \equiv \frac{1}{2} \big( h_\theta (x^{(i)}) - y^{(i)}\big)^2$ (without the summation). Therefore, keep in mind that backpropagation will calculate $\nabla_\theta J^{(i)}$ for each $i$, but not $\nabla_\theta J$ in general. However, it is an elementary fact in calculus that the gradient of $J$ is equal to the average of the gradients of all $J^{(i)}$'s, so it actually does suffice to calculate each $\nabla_\theta J^{(i)}$.
For each $x^{(i)}$, we can compute all the $z^{[j](i)}$'s and $a^{[j](i)}$'s to get $h_\theta (x^{(i)}) = a^{[r](i)}$ and calculate (using $y^{(i)}$) $J^{(i)}$, too.
\begin{align*}
x^{(i)} & \xrightarrow{W^{[1]} x^{(i)} + b^{[1]}} z^{[1](i)} \xrightarrow{\sigma(z^{[1](i)})} a^{[1](i)} \\
& \xrightarrow{W^{[2]} a^{[1](i)} + b^{[2]}} z^{[2](i)} \xrightarrow{\sigma(z^{[2](i)})} a^{[2](i)}\\
& \ldots \\
& \xrightarrow{W^{[r-1]} a^{[r-2](i)} + b^{[r-1]}} z^{[r-1](i)} \xrightarrow{\sigma(z^{[r-1](i)})} a^{[r-1](i)}\\
& \xrightarrow{W^{[r]} a^{[r-1](i)} + b^{[r]}} z^{[r](i)} \xrightarrow{a^{[r](i)} = z^{[r](i)}} a^{[r](i)}\\
& \xrightarrow{\frac{1}{2} (y^{(i)} - a^{[r](i)})^2} J^{(i)} \;\;\;\;\; \text{ for all } i = 1, 2, \ldots, n
\end{align*}
Commencing the backward pass, we first compute for $k=r$
\begin{align*}
\frac{\partial J^{(i)}}{\partial W^{[r]}} = \big(z^{[r](i)} - y^{(i)} \big) a^{[r-1](i)\,T} & \implies \frac{\partial J}{\partial W^{[r]}} = \frac{1}{n} \sum_{i=1}^n \frac{\partial J^{(i)}}{\partial W^{[r]}} = \frac{1}{n} \sum_{i=1}^n \big(z^{[r](i)} - y^{(i)} \big) a^{[r-1](i)\,T} \\
\frac{\partial J^{(i)}}{\partial b^{[r]}} = \big(z^{[r](i)} - y^{(i)} \big) \;\;\;\;\;\;\;\;\;\;\;\;\;\;& \implies \frac{\partial J}{\partial b^{[r]}} \;\;\;\; = \frac{1}{n} \sum_{i=1}^n \frac{\partial J^{(i)}}{\partial b^{[r]}} = \frac{1}{n} \sum_{i=1}^n \big(z^{[r](i)} - y^{(i)} \big)
\end{align*}
Then, we continue down for $k=r-1, \ldots, 2, 1$ to compute the partials for the rest of the $W^{[i]}, b^{[i]}$'s.
\begin{align*}
\frac{\partial J^{(i)}}{\partial W^{[k]}} & = \bigg( W^{[k+1]\,T} \, \frac{\partial J^{(i)}}{\partial z^{[k+1](i)}} \bigg) \odot \sigma^\prime (z^{[k](i)}) a^{[k-1](i)\,T} \\
& \implies \frac{\partial J}{\partial W^{[k]}} = \frac{1}{n} \sum_{i=1}^n \frac{\partial J^{(i)}}{\partial W^{[k]}} = \frac{1}{n} \sum_{i=1}^n \bigg( W^{[k+1]\,T} \, \frac{\partial J^{(i)}}{\partial z^{[k+1](i)}} \bigg) \odot \sigma^\prime (z^{[k](i)}) a^{[k-1](i)\,T} \\
\frac{\partial J^{(i)}}{\partial b^{[k]}} & = \bigg( W^{[k+1]\,T} \, \frac{\partial J^{(i)}}{\partial z^{[k+1](i)}} \bigg) \odot \sigma^\prime (z^{[k](i)}) \\
& \implies \frac{\partial J}{\partial W^{[k]}} = \frac{1}{n} \sum){i=1}^n \frac{\partial J^{(i)}}{\partial b^{[k]}} = \frac{1}{n} \sum_{i=1}^n \bigg( W^{[k+1]\,T} \, \frac{\partial J^{(i)}}{\partial z^{[k+1](i)}} \bigg) \odot \sigma^\prime (z^{[k](i)})
\end{align*}
Therefore, we have the gradient, labeled $\nabla_\theta J$, which is itself a function that takes in inputs $W^{[1]}, W^{[2]}, \ldots, W^{[r]}, b^{[1]}, \ldots, b^{[r]}$ and outputs a vector in the parameter space of $\theta$ pointing in the direction of steepest increase of $J$. Using GD, we can initialize a random $\theta$ in the parameter space and use the iterative algorithm
\[\theta = \theta - \nabla_\theta J (\theta)\]
In words, given that we are at a position $\theta$ describing some set of parameters of the neural network, $-\nabla_\theta J (\theta)$ takes in this point and outputs the vector that we should travel in to decrease $J$ as much as possible. We update $\theta$ with $\theta - \nabla_\theta J (\theta)$ and do this until convergence.
Vectorization over Training Samples
Rather than computing everything independently for each sample $(x^{(i)}, y^{(i)})$, we can update them in batches using basic matrix properties. That is, given the following system, with $z_i, b \in \mathbb{R}^m, x_i \in \mathbb{R}^n, W \in \text{Mat}(n \times m, \mathbb{R}$),
\begin{align*}
z_1 & = W x_1 + b \\
z_2 & = W x_2 + b \\
\ldots & = \ldots \\
z_n & = W x_n + b
\end{align*}
we can encode it as the matrix equation
\[Z = W X + \tilde{b} \iff \begin{pmatrix} | & \ldots & | \\ z_1 & \ldots & z_n \\ | & \ldots & | \end{pmatrix} = W \begin{pmatrix}
| & \ldots & | \\
x_1 & \ldots & x_n \\
| & \ldots & |
\end{pmatrix} + \begin{pmatrix} | & \ldots & | \\ b & \ldots & b \\ | & \ldots & | \end{pmatrix} \]
and therefore, by setting the inputs and outputs as
\[X = \begin{pmatrix} | & \ldots & | \\ x^{(1)} & \ldots & x^{(n)} \\ | & \ldots & | \end{pmatrix} \in \text{Mat}(d \times n, \mathbb{R}), \; Y = \begin{pmatrix} y^{(1)} & \ldots & y^{(n)} \end{pmatrix} \in \text{Mat}(1 \times n, \mathbb{R})\]
along with the intermediate values (note that $m_0 = d, m_r = 1$) as
\[Z^{[i]} = \begin{pmatrix} | & \ldots & |\\ z^{[i](1)} & \ldots & z^{[i](n)} \\ | & \ldots & |\end{pmatrix}, \; A^{[i]} = \begin{pmatrix} | & \ldots & | \\ a^{[i](1)} & \ldots & a^{[i](n)} \\ | & \ldots & | \end{pmatrix}, \; \tilde{b}^{[i]} = \begin{pmatrix} | & \ldots & | \\ b^{[i]} & \ldots & b^{[i]} \\ | & \ldots & | \end{pmatrix} \in \text{Mat}(m_i \times n, \mathbb{R})\]
and mappings as
\[W^{[i]} \in \text{Mat}(m_{i} \times m_{i-1}, \mathbb{R}), \;\;\;\;
\sigma (Z^{[i]}) = \begin{pmatrix} | & \ldots & |\\ \sigma(z^{[i](1)}) & \ldots & \sigma(z^{[i](n)}) \\ | & \ldots & |\end{pmatrix} = \begin{pmatrix} \sigma(z^{[i](1)}_1) & \ldots & \sigma(z^{[i](n)}_1) \\
\vdots & \ddots & \vdots \\
\sigma(z^{[i](1)}_{m_i}) & \ldots & \sigma(z^{[i](n)}_{m_i}) \end{pmatrix}\]
each of the $n$ series of compositions of mappings for each $x^{(i)}$ can now be visualized as one series:
\begin{align*}
X & \xrightarrow{W^{[1]} X + \tilde{b}^{[1]}} Z^{[1]} \xrightarrow{\sigma(Z^{[1]})} A^{[1]} \\
& \xrightarrow{W^{[2]} A^{[1]} + \tilde{b}^{[2]}} Z^{[2]} \xrightarrow{\sigma(Z^{[2]})} A^{[2]}\\
& \ldots \\
& \xrightarrow{W^{[r-1]} A^{[r-2]} + \tilde{b}^{[r-1]}} Z^{[r-1]} \xrightarrow{\sigma(Z^{[r-1]})} A^{[r-1]}\\
& \xrightarrow{W^{[r]} A^{[r-1]} + \tilde{b}^{[r]}} Z^{[r]} \xrightarrow{A^{[r]} = Z^{[r]}} A^{[r]}\\
& \xrightarrow{\frac{1}{2} ||Y - A^{[r]}||^2} J
\end{align*}
Note that $A^{[r]} = Z^{[r]}$ is the predicted value, an row $n$-vector, since that final mapping $W^{[r]}$ maps $\mathbb{R}^{m_{r-1}}$ to the reals $\mathbb{R}$. Furthermore, the mean-square cost function can be directly computed (without steps of summing the individual cost functions for each sample) by merely finding the square of the $L_2$-distance between the given $Y$ and the predicted value $A^{[r]}$ and scaling it by $\frac{1}{2n}$. It's the same process as above, but the process is much neater to look at.