June 17th, 2021, 10:51 am

The more I investigate Gradient Descent (GD) the more I believe ML have opened a can of worms (in their naivety). Invoking Sapir-Whorf, life begins with linear algebra and iterative, i.e. they are looking at the solution. Because algebra is low-hanging fruit.

Let's skip the zillion fixes and patches for GD. Enough blogs already. One example however is

*The learning rate is a hyperparameter that controls how much to change the model in response to the estimated error each time the model weights are updated. ... Momentum can accelerate training and learning rate schedules can help to converge the optimization process.*

There is much ado with this and a whole cottage industry has grown around it, e.g. grid search ugh.

GD is really a FD scheme for dissipative gradient ODE (Lagrange, Poincare); the learning rate is the step size in the ODE solver. Equality and inequality constraints are easy (try with GD..)