April 23, 2019

Backpropagation and Gradient Descent

Gradient Descent

Gradient descent is an optimization algorithm used to train neural networks, the purpose of this algorithm is to find the best values for the parameters W and b that minimize the loss function.

With the code below we change the value of W:


W = W - learning_rate * dW

We need the derivative of W dW and a hyperparameter called learning rate.

When we are training a neural network we compute the gradient descent algorithm multiple times to minimize the loss function.

If the loss function value increases this algorithm will decrease the value of W, whereas if the loss function value decreases the algorithm will increase the value of W.

if the loss function value increased then dW is positive as a result when we multiple dW by the learning rate which is negative we end up with a negative value that decreases the value of W:


W = W - 0.003 * 0.04

if the loss function value decreased then dW is negative as a result when we multiple dW by the learning rate we end up with a positive value that increases the value of W:


W = W - 0.003 * -0.06

The learning rate is a hyper parameter that says how much the value of W will increment or decrement. We compute the derivate of W to know how the change in W affects the final value of the loss function.

We can illustrate the gradient descent algorithm as follows:

Gradient Descent

The yellow star indicates the position where the loss function minimizes its value and the color points are the path that W has to follow to reach this position.

We repeat this process for the parameter b as well.

Backpropagation

As we have seen in the previous section, we need the derivatives of W and b to perform the gradient descent algorithm.

In python we use the code below to compute the derivatives of a neural network with two hidden layers and the sigmoid activation function. Usually we call this the backpropagation step:


dZ2 = A2 - Y


dW2 = (dZ2 * A1)


db2 = dZ2


dZ1 = dW2 * dZ2 * primesigmoid(Z1)


dW1 = dZ1 * X


db1 = dZ1

In the following section I will explain the maths behind this algorithm.

Derivatives

Before getting to the algorithm we need to know some concepts about derivatives:

Derivative

A derivative measures the change of the function value (in this case the loss function) with respect to a change in its argument (in this case W2, W1 and b2, b2).

Partial derivative

We compute a derivative of a function with respect to its variable, usually is x, however if the function has two or more variables we compute its derivative with respect to one of those variables and the remaining variables held constant, we call this a partial derivative.

Chain rule

This rule helps us when we want to compute a derivative of a function with respect to one variable but the function and the variable are far from each other. In this case if we want the derivatives of the loss function with respect to W1, W2, b1, b2 we need to chain the functions L, a2, z2, a1, z1 to create a path between W1, W2, b1, b2 and the loss function, therefore we need to compute the derivatives of these functions.

To represent a derivate we will use the syntaxis dx, d_L_/d_a_: this means the derivative of L with respect to a.

The neural network


Z1 = W1 * x + b1


a1 = sigmoid(Z1)


Z2 = W2 * a1 + b2


a2 = sigmoid(Z2)


pred_y = L(a2, y)

Loss function derivative (L)

As the name suggest in the backpropagation algorithm we start computing the derivative of the last function, in this case the loss function:


L = yIn(a2) + (1 - y)In(1 - a2)

Here we have a partial derivative since the function has two variables (a2, y), we want the derivative of this function with respect to a2 that is the same than: d_L_/d_a2_.

First we compute the derivative of the first term yIn(a2)

The derivative of In(u) is 1/u


1y/a2

How the y variable multiples In(u) we keep y and we know than 1 * y is the same than just y so at the end we have:


y/a2

The second term is (1 - y)In(1 - a2), as we can see is the same, the derivative of In(u) is 1/u so:


(1-y) / (1-a2)

The final result is:


dL/da2 = y/a2 + (1 - y)/(1 - a2)

Second sigmoid function derivative

This is the hardest derivative of the neural network, we know the sigmoid function:


1/1 + e^-z

The derivative of this function is:


da2/dZ2 = a2(1 - a2)

if you want a more complete explanation you can read this post.

we know that:


a2 = sig(Z2)

In the backpropagation code this derivative is:


primesigmoid(Z1)

Second lineal function derivative (Z2)

Here we need to apply the chain rule and join the two previous derivatives:


dL/dZ2 = dL/da2 * da2/dZ2


dL/dZ2 = (y/a2 + (1 - y)/(1 - a2)) * (a2(1 - a2))

The result is:


A2 - Y

Now we have the first derivative that appears in the backpropagation code:


dZ2 = A2 - Y

W2 Derivative

Here we have another partial derivative d_Z2_/d_W2_ but in this case we have three variables (W2, b2, a2).


Z2 = W2 * a1 + b2

We will compute the derivative of z2 with respect to W2 then a1 and b2 will held constant:

The derivative of a constant is 0, b2 = 0


W2 * a1 + 0

The derivative of a multiplication between a constant and a variable is just the constant:


dZ2/dW2 = a1

Now we want to know how the change in W2 affects the loss function L final value, therefore we need d_L_/d_W2_, we can apply the chain rule to compute this derivative.

since d_L_/d_Z2_ is already connected to the loss function, we only need to multiply d_Z2_/d_W2_ by d_L_/d_Z2_:


dL/dW2 = dZ2/dW2 * dL/dZ2


dL/dW2 = a1 * (a2 - y)

In the backpropagation code we have:


dW2 = (dZ2 * A1)

And we know that:


dZ2 = A2 - Y

b2 Derivative

We have the same function:


Z2 = W2 * a1 + b2

But in this case we want the derivative of Z2 with respect to b2 dZ2/db2, therefore now W2 is a constant:

The derivative of a constant is 0, W2 = 0


0 * a1 + b1


0 * a1 = 0


0 + b1

The derivative of a variable is its constant:


dZ2/db2 = 1

Now we want to know how the change in b2 affects the loss function value, therefore we compute d_L_/d_b2_. Again we need the chain rule:


dL/db2 = dZ2/db2 * dL/dZ2


dL/db2 = 1 * (a2 - y)


dL/db2 = (a2 - y)

In the backpropagation code it appears like:


db2 = dZ2

And we know that:


dZ2 = A2 - y

First sigmoid function derivative

This derivative is the same than the second sigmoid function but in this case the variable is a1


da1/dZ1 = a1(1 - a1)

First lineal function derivative (Z1)

The first lineal function is:


Z1 = W1 * x + b1

We use the chain rule to compute d_L_/d_Z1_


dL/dZ1 = dZ2/dW2 * dZ2/db2 * da1/dZ1


dL/dZ1 =  a1 * (a2 - y) * a1(1 - a1)

In the backpropagation code it is:


dZ1 = dW2 * dZ2 * primesigmoid(Z1)]

W1 Derivative

The process is the same than dW2 but now the function is:


Z1 = W1 * x + b1

First we compute d_Z1_/d_W1_


b1 = 0


W1 * x + 0


dZ1/dW1 = x

Now we need to know how the change in W1 affects the loss function value, then we compute d_L_/d_W1_:


dL/dW1 = dL/dZ1 * dZ1/dW1


dL/dW1 = (a1 * (a2 - y) * (a2 - y) * a1(1 - a1)) * x

In the backpropagation code:


dW1 = dZ1 * X

b1 Derivative

Again the process is the same than db2:


Z1 = W1 * x + b1


dZ1/db1 = 1

We apply the chain rule to compute dL/db1


dL/db1 = dZ1/db1 * dL/dZ1


dL/db1 = 1 * ((a1 * (a2 - y) * (a2 - y))

In the backpropagation code:


db1 = dZ1