Do believe ODE/PDE are required for understanding DL or might it be just one lens among many?

I think it's very import what Cuch is working on, nobody has ever thought about the basic math

I am trying fist of all to learn DL (it is very interesting), so i try to connect up to the matter from my own background. So, it might come across as being awkward but I wish to flag each assumption.

Fair enough. There is of course *lots* material you can read about any narrow aspect in the field

IMO you're too much focusing on loss minimization without picking up on the fact that the goal is never to find the absolute minimum of the loss, you ingnore the statistical side.

The field is called *statistical* machine learning because you want to learn models from finite data sets.You have a finite set of high dimensional {x,y} pairs with unknown relation between them. The goals if to find a function y_ = f(x,theta) that predict y best given x. You have to pick a functional form for f and then find the theta. f could be a line and theta slope + inception without changing the argument. If you use DL then the functional form is a NN with multiple layers, you have to pick the topology, activation functions etc. For finding theta you can use all sort of methods, including trying out random theta values and see which give the best result. That works really well if you surface is very rough. In DL people try to find good activation functions that shape the loss function to make it work nicely with GD. It's a computational metric: smooth is good, fast gradient computation is good. Thing are pulled together intelligently: if you aren't allowed to use GD (or can't) then you pick a different type of f or loss function and make it perform well under those constraints. E.g. for some functions you can't specify a loss function.

The point where "statistical" comes is in that you have a finite set of samples. A large network would be able to memorize all samples which would give zero loss, the ultimate lowest point in the loss surface! But you don't want to memorize, you want to general structure in the data that links x to y and use that to find good y' values when presented with *new* {x'} samples. When training a NN you always monitor generalisation vs memorizing. The stop condition depends mostly on that and not much on the loss property. This is also one of the reasons people use stochastic gradient descent instead of full gradient descent (this has analogies to bootstap aggregation): by changing the subset of {x,y} pair in every gradient step you change the loss surface in every step. This helps the NN not getting anal on 'the one and true loss function that it wants to memorize", it's helps conveying the fact that you have uncertaintly in the loss surface dues to sample noise (finite set of training samples). The other reason is computational constraints: you have GPUs with limited memory and a large dataset you can't load into the GPU memory so you process data in small batches.

For me the most interesting research area at the moment is the information theoretical angle on learning. When you train a Deep NN it need to do two things: learn to extract features from x and then use those features to predict y. A recent paper triggered lots of discussion:

https://arxiv.org/abs/1703.00810 and I think that innovation wrt training will come from it.

ps: for GD the biggest issue is saddlepoints, not local minima. In high dimensions you'll find much less locations where all the gradient have the same sign than places where the gradients have different signs. All sorts of classical tricks are applied to try and tackle that: adding noise, adding momentum,..