Prerequisites
- Gradient and its main properties.
- Vectors as n \times 1 or 1 \times n matrices.
Introduction
Gradient Descent is a very popular and powerful tool which is used when solving the equation exactly is not possible.
- It gives you a very good approximation to the equation and not the exact answer.
- It is used to find one of the local minima of the known function f.
- This is very useful when we have to reduce some function f which have parameters w_i which can be varied.
- Trainable inputs are the inputs which can be tweaked to reach the minima.
- Hyper parameters are inputs which are decided by the user and can’t be trained or varied.
Method
Let f be a function whose trainable inputs are w_i ,\space i \in \{1, 2, ..., n\}\space where\space n\in N. gradient descent algorithm goes as follows:
- We define w = [w_1, w_2,\space ...\space w_n]^T and guess it randomly.
- Now we update w \space as \space w_{final} = w_{initial} - c \times \nabla f (w_{initial}), where c \in R is a fixed constant defined by the user. It is also known as the learning\space rate of the method.
- We keep repeating step 2 until we get \nabla f(w_{initial}) small enough such that (w_{final} - w_{inital} ) is smaller than the desired accuracy.
- The above can also be achieved if we round of w to one more significant digits from the required accuracy and stop whenever w_{initial} = w_{final} .
Proof of convergence
It is not guaranteed that gradient descent should converge, but it always converges to a local minima, saddle point or goes to - \infty if c \to 0 \space and number \space of \space iterations \to \infty .
Proof: Let \hat{n} be the unit vector in the direction of \nabla f(w_{initial}).
Let \space \Delta w = w_{final} \space- \space w_{initial} \space and \space \Delta f = f(w_{final}) - f(w_{initial})
\space\space\space\space\space\space\space\space\space\space -c \times \nabla f(w_{initial}) = - |c \times \nabla f(w_{initial})| \space \hat{n}
\space\space\space\space\space Now, \nabla f(w_{initial}) = |\nabla f(w_{initial})|\space \hat{n}
\space\space\space\space\space We \space know \space that \space df = \nabla f(w_{initial}) \cdot dr \space\space\space\space \implies \Delta f \approx \nabla f(w_{initial}) \cdot \Delta w \space\space\space\space\space\space\space = |\nabla f(w_{initial})|\space \hat{n} \cdot |c \times \nabla f(w_{initial})| (-\hat{n})
\space\space\space\space\space\space\space = - c \times (\nabla f(w_{inital}))^2 \ )
\space\space\space\space\space\space\space \le 0 \space\space\space\space\space\space\space (as \space c > 0)
As \Delta f < 0 , f always decreases unless \nabla f(w_{initial}) = 0.
\therefore f \space reaches a local minima, saddle point or diverges towards -\infty.
Saddle points (Optional)
For the definition to Saddle point visit
To learn more about Saddle point visit
This can also make f to reach a saddle point. If it reaches exactly to a saddle point (Which is unlikely) then we are stuck.
If we are slightly off, then there exists some direction \hat{x} \space such \space that \space \dfrac{\partial ^ 2 f(w_{initial})} {\partial \hat{x} ^ 2} < 0. So if we change the basis to another orthonormal basis such that \hat x is one of the vectors of the basis then because \Delta w has some component in \hat x direction also. It means that it will decrease further (As if we take f as a function of \hat x only, keeping all the other variables fixed at w_{initial} . We get that f has a maxima with respect to \hat x. This implies that it’ll diverge from there.
An Example
https://www.youtube.com/watch?v=kJgx2RcJKZY
Choosing the Perfect Learning Rate (c)
The learning rate is a crucial hyperparameter for the gradient descent optimization algorithm, as it determines the size of the steps taken at each iteration. A good learning rate should be large enough to converge quickly to the minimum of the cost function, but not so large that it overshoots or oscillates around the minimum. Here are some approaches to find a good learning rate for gradient descent:
- Train the model with different guesses (w) and different learning rates (c) and which set \{c, w\} gives the lowest value of f(w_{inital}) in some fixed finite number of iterations.
- You can also use some known optimizers which are slight variations to Gradient descent like Adam, Adamax, Gradient descent with momentum etc. Some of these optimizers also change the value of c depending on the value of \nabla f(w_{initial}). It will keep the value of c small if \nabla f(w_{initial}) is large so that it doesn’t diverge. It will keep the value of c large if \nabla f(w_{initial}) is small so that it doesn’t take too many iterations to reach the local minima.
Comments
Post a Comment