This (benign) example is a bit like the economist on the train to Scotland by train, looks out window and sees a black sheep. Thus all sheep are black, not. That sheep was a single sheep with one side black.GD converges on non-differentiable functions as long as they're convex. E.g. it will handle f(x) = |x| as well as f(x) = x^2.Anonymous Quote

I know GD, but I don't have working experiences with GD. I've some experiences with EM algorithm applying to various Markov models. There are some points you mentioned I can relate as below:

- Initial guess close to real solution (Analyse Numerique 101).
- No guarantee that GD is applicable in the first place (assumes cost function is smooth).
- Convergence to local minimum.
- The method is iterative, so no true reliable quality of service (QOS).
- It's not very robust

**GD for discontinuous functions is ill-defined,**e.g. Heaviside function. In fact, I would be surprised if your example works. It could be serendipitous (get lucky).

Nocedal//Wright briefly discuss non-smooth problems and sub-gradients.

I am assuming you implicitly agree with my other bullet points.

Have we forgotten exploding gradients??

https://en.wikipedia.org/wiki/Vanishing ... nt_problem

*For non-differentiable functions, gradient methods are ill-defined. For locally Lipschitz problems and especially for convex minimization problems, bundle methods of descent are well-defined. Non-descent methods, like subgradient projection methods, may also be used. These methods are typically slower than gradient descent.*