Prerequisites

Gradient and its main properties.
Vectors as $n \times 1$ or $1 \times n$ matrices.

Introduction

Gradient Descent is a very popular and powerful tool which is used when solving the equation exactly is not possible.

It gives you a very good approximation to the equation and not the exact answer.
It is used to find one of the local minima of the known function $f$.
This is very useful when we have to reduce some function $f$ which have parameters $w_i$ which can be varied.
Trainable inputs are the inputs which can be tweaked to reach the minima.
Hyper parameters are inputs which are decided by the user and can’t be trained or varied.

Method

Let $f$ be a function whose trainable inputs are $w_i ,\space i \in \{1, 2, ..., n\}\space where\space n\in N$. gradient descent algorithm goes as follows:

We define $w = [w_1, w_2,\space ...\space w_n]^T$ and guess it randomly.
Now we update $w \space as \space w_{final} = w_{initial} - c \times \nabla f (w_{initial})$, where $c \in R$ is a fixed constant defined by the user. It is also known as the $learning\space rate$ of the method.
We keep repeating step 2 until we get $\nabla f(w_{initial})$ small enough such that $(w_{final} - w_{inital} )$ is smaller than the desired accuracy.
The above can also be achieved if we round of $w$ to one more significant digits from the required accuracy and stop whenever $w_{initial} = w_{final}$ .

Proof of convergence

It is not guaranteed that gradient descent should converge, but it always converges to a local minima, saddle point or goes to $- \infty$ if $c \to 0 \space$ and $number \space of \space iterations \to \infty$ .

$Proof:$ Let $\hat{n}$ be the unit vector in the direction of $\nabla f(w_{initial})$.

Let $\space \Delta w = w_{final} \space- \space w_{initial} \space and \space \Delta f = f(w_{final}) - f(w_{initial}) $

$\space\space\space\space\space\space\space\space\space\space -c \times \nabla f(w_{initial}) = - |c \times \nabla f(w_{initial})| \space \hat{n}$

$\space\space\space\space\space Now, \nabla f(w_{initial}) = |\nabla f(w_{initial})|\space \hat{n} $

$\space\space\space\space\space We \space know \space that \space df = \nabla f(w_{initial}) \cdot dr $ $\space\space\space\space \implies \Delta f \approx \nabla f(w_{initial}) \cdot \Delta w $ $\space\space\space\space\space\space\space = |\nabla f(w_{initial})|\space \hat{n} \cdot |c \times \nabla f(w_{initial})| (-\hat{n}) $

$\space\space\space\space\space\space\space = - c \times (\nabla f(w_{inital}))^2 \ )$

$\space\space\space\space\space\space\space \le 0 \space\space\space\space\space\space\space (as \space c > 0) $

As $\Delta f < 0 ,$ f always decreases unless $\nabla f(w_{initial}) = 0.$

$\therefore f \space$ reaches a local minima, saddle point or diverges towards $-\infty$.

Saddle points (Optional)

For the definition to Saddle point visit

Saddle point

To learn more about Saddle point visit

Saddle point

This can also make $f$ to reach a saddle point. If it reaches exactly to a saddle point (Which is unlikely) then we are stuck.

If we are slightly off, then there exists some direction $\hat{x} \space such \space that \space \dfrac{\partial ^ 2 f(w_{initial})} {\partial \hat{x} ^ 2} < 0.$ So if we change the basis to another orthonormal basis such that $\hat x$ is one of the vectors of the basis then because $\Delta w$ has some component in $\hat x$ direction also. It means that it will decrease further (As if we take $f$ as a function of $\hat x$ only, keeping all the other variables fixed at $w_{initial}$ . We get that $f$ has a maxima with respect to $\hat x$. This implies that it’ll diverge from there.

An Example

https://www.youtube.com/watch?v=kJgx2RcJKZY

Choosing the Perfect Learning Rate ($c$)

The learning rate is a crucial hyperparameter for the gradient descent optimization algorithm, as it determines the size of the steps taken at each iteration. A good learning rate should be large enough to converge quickly to the minimum of the cost function, but not so large that it overshoots or oscillates around the minimum. Here are some approaches to find a good learning rate for gradient descent:

Train the model with different guesses ($w$) and different learning rates ($c)$ and which set $\{c, w\}$ gives the lowest value of $f(w_{inital})$ in some fixed finite number of iterations.
You can also use some known optimizers which are slight variations to Gradient descent like Adam, Adamax, Gradient descent with momentum etc. Some of these optimizers also change the value of $c$ depending on the value of $\nabla f(w_{initial})$. It will keep the value of $c$ small if $\nabla f(w_{initial})$ is large so that it doesn’t diverge. It will keep the value of $c$ large if $\nabla f(w_{initial})$ is small so that it doesn’t take too many iterations to reach the local minima.

Search This Blog

Gradient Descent