April 23, 2019

# Backpropagation and Gradient Descent

## Gradient Descent

Gradient descent is an optimization algorithm used to train neural networks, the purpose of this algorithm is to find the best values for the parameters **W** and **b** that minimize the **loss function**.

With the code below we change the value of **W:**

`W = W - learning_rate * dW`

We need the derivative of W **dW** and a **hyperparameter** called **learning rate**.

When we are training a neural network we compute the gradient descent algorithm multiple times to minimize the **loss function**.

If the loss function value increases this algorithm will decrease the value of **W**, whereas if the loss function value decreases the algorithm will increase the value of **W**.

if the loss function value increased then **dW** is positive as a result when we multiple **dW** by the learning rate which is negative we end up with a negative value that decreases the value of W:

`W = W - 0.003 * 0.04`

if the loss function value decreased then **dW** is negative as a result when we multiple **dW** by the learning rate we end up with a positive value that increases the value of W:

`W = W - 0.003 * -0.06`

The **learning rate** is a **hyper parameter** that says how much the value of **W** will increment or decrement. We compute the derivate of **W** to know how the change in **W** affects the final value of the loss function.

We can illustrate the gradient descent algorithm as follows:

The yellow star indicates the position where the loss function minimizes its value and the color points are the path that **W** has to follow to reach this position.

We repeat this process for the parameter **b** as well.

## Backpropagation

As we have seen in the previous section, we need the derivatives of **W** and **b** to perform the **gradient descent** algorithm.

In python we use the code below to compute the derivatives of a neural network with two hidden layers and the sigmoid activation function. Usually we call this the backpropagation step:

`dZ2 = A2 - Y`

`dW2 = (dZ2 * A1)`

`db2 = dZ2`

`dZ1 = dW2 * dZ2 * primesigmoid(Z1)`

`dW1 = dZ1 * X`

`db1 = dZ1`

In the following section I will explain the maths behind this algorithm.

### Derivatives

Before getting to the algorithm we need to know some concepts about derivatives:

**Derivative**

A derivative measures the change of the function value **(in this case the loss function)** with respect to a change in its argument (in this case **W2, W1** and **b2, b2**).

**Partial derivative**

We compute a derivative of a function with respect to its variable, usually is **x**, however if the function has two or more variables we compute its derivative with respect to one of those variables and the remaining variables held constant, we call this a partial derivative.

**Chain rule**

This rule helps us when we want to compute a derivative of a function with respect to one variable but the function and the variable are far from each other. In this case if we want the derivatives of the loss function with respect to **W1, W2, b1, b2** we need to chain the functions L, a2, z2, a1, z1 to create a path between **W1, W2, b1, b2** and the loss function, therefore we need to compute the derivatives of these functions.

To represent a derivate we will use the syntaxis **dx**, **d L/da**: this means the derivative of

**L**with respect to

**a**.

### The neural network

`Z1 = W1 * x + b1 `

`a1 = sigmoid(Z1)`

`Z2 = W2 * a1 + b2`

`a2 = sigmoid(Z2)`

`pred_y = L(a2, y)`

### Loss function derivative (L)

As the name suggest in the backpropagation algorithm we start computing the derivative of the last function, in this case the loss function:

`L = yIn(a2) + (1 - y)In(1 - a2)`

Here we have a partial derivative since the function has two variables **(a2, y)**, we want the derivative of this function with respect to **a2** that is the same than: **d L/da2**.

First we compute the derivative of the first term **yIn(a2)**

The derivative of **In(u)** is **1/u**

`1y/a2`

How the *y* variable multiples **In(u**) we keep *y* and we know than **1 * y** is the same than just **y** so at the end we have:

`y/a2`

The second term is **(1 - y)In(1 - a2)**, as we can see is the same, the derivative of **In(u)** is **1/u** so:

`(1-y) / (1-a2)`

The final result is:

`dL/da2 = y/a2 + (1 - y)/(1 - a2)`

### Second sigmoid function derivative

This is the hardest derivative of the neural network, we know the sigmoid function:

`1/1 + e^-z`

The derivative of this function is:

`da2/dZ2 = a2(1 - a2)`

if you want a more complete explanation you can read this post.

we know that:

`a2 = sig(Z2)`

In the backpropagation code this derivative is:

`primesigmoid(Z1)`

### Second lineal function derivative (Z2)

Here we need to apply the chain rule and join the two previous derivatives:

`dL/dZ2 = dL/da2 * da2/dZ2`

`dL/dZ2 = (y/a2 + (1 - y)/(1 - a2)) * (a2(1 - a2))`

The result is:

`A2 - Y`

Now we have the first derivative that appears in the backpropagation code:

`dZ2 = A2 - Y`

### W2 Derivative

Here we have another partial derivative **d Z2/dW2** but in this case we have three variables

**(W2, b2, a2)**.

`Z2 = W2 * a1 + b2`

We will compute the derivative of **z2** with respect to **W2** then **a1** and **b2** will held constant:

The derivative of a constant is 0, b2 = 0

`W2 * a1 + 0`

The derivative of a multiplication between a constant and a variable is just the constant:

`dZ2/dW2 = a1`

Now we want to know how the change in **W2** affects the loss function **L** final value, therefore we need **d L/dW2**, we can apply the chain rule to compute this derivative.

since **d L/dZ2** is already connected to the loss function, we only need to multiply

**d**by

*Z2*/d*W2***d**:

*L*/d*Z2*`dL/dW2 = dZ2/dW2 * dL/dZ2`

`dL/dW2 = a1 * (a2 - y)`

In the backpropagation code we have:

`dW2 = (dZ2 * A1)`

And we know that:

`dZ2 = A2 - Y`

### b2 Derivative

We have the same function:

`Z2 = W2 * a1 + b2`

But in this case we want the derivative of **Z2** with respect to **b2** **dZ2/db2**, therefore now **W2** is a constant:

The derivative of a constant is 0, W2 = 0

`0 * a1 + b1`

`0 * a1 = 0`

`0 + b1`

The derivative of a variable is its constant:

`dZ2/db2 = 1`

Now we want to know how the change in **b2** affects the loss function value, therefore we compute **d L/db2**. Again we need the chain rule:

`dL/db2 = dZ2/db2 * dL/dZ2`

`dL/db2 = 1 * (a2 - y)`

`dL/db2 = (a2 - y)`

In the backpropagation code it appears like:

`db2 = dZ2`

And we know that:

`dZ2 = A2 - y`

### First sigmoid function derivative

This derivative is the same than the second sigmoid function but in this case the variable is **a1**

`da1/dZ1 = a1(1 - a1)`

### First lineal function derivative (Z1)

The first lineal function is:

`Z1 = W1 * x + b1`

We use the chain rule to compute **d L/dZ1**

`dL/dZ1 = dZ2/dW2 * dZ2/db2 * da1/dZ1`

`dL/dZ1 = a1 * (a2 - y) * a1(1 - a1)`

In the backpropagation code it is:

`dZ1 = dW2 * dZ2 * primesigmoid(Z1)]`

### W1 Derivative

The process is the same than **dW2** but now the function is:

`Z1 = W1 * x + b1`

First we compute **d Z1/dW1**

`b1 = 0`

`W1 * x + 0`

`dZ1/dW1 = x`

Now we need to know how the change in **W1** affects the loss function value, then we compute **d L/dW1**:

`dL/dW1 = dL/dZ1 * dZ1/dW1`

`dL/dW1 = (a1 * (a2 - y) * (a2 - y) * a1(1 - a1)) * x`

In the backpropagation code:

`dW1 = dZ1 * X`

### b1 Derivative

Again the process is the same than **db2**:

`Z1 = W1 * x + b1`

`dZ1/db1 = 1`

We apply the chain rule to compute **dL/db1**

`dL/db1 = dZ1/db1 * dL/dZ1`

`dL/db1 = 1 * ((a1 * (a2 - y) * (a2 - y))`

In the backpropagation code:

`db1 = dZ1`