ISayMoo wrote:AFAIK the standard practice is to evaluate the full gradient and use it, but recently there was a number of papers discussing Block Coordinate Descent algorithms. But they all take into account the structure of the network. Yours doesn't.

This family of algorithms could be interesting indeed. They are many years old and well-known in the numerical analysis community (see e.g. Ortega and Rheinboldt etc.) Basically, an n-d problem can be solved as a sequence of simpler 1-d (or blocked) problems (ADI, ADE, Gauss-Seidel etc.). So it has a great future behind it.

It is easy to program and can be parallelised (the article by Stephen Wright is nice). The idea of computing the full gradient is becoming less palatable (even the current simple code attests to this) than computing simpler partial derivatives which may even have analytical solutions.

IMO it could be useful. I suppose anything is better than pesky learning rates?