SERVING THE QUANTITATIVE FINANCE COMMUNITY

Cuchulainn
Posts: 61155
Joined: July 16th, 2004, 7:38 am
Location: Amsterdam
Contact:

### Re: DL and PDEs

Are ODEs any good for solving simultaneous equations? (It may be a stupid question. I don't know the answer to that because I've never looked.)

This question has been posed and solved on my famous and extremely popular $e^5$ thread. It is one of the 19 solutions. Have you forgotten? I posted a link already.
(It's a wee bit silly, that question, not because it is stupid (which it is not), but it is more of a lightning deflector).

Now that you mention it ...

More generally, you can embed a nonlinear system f(x) = 0 in a homotopy/continuation

H(x,t) = (1-t) { f(x) - f(x_0)} + t f(x), t in [0,1].

Differentiate WRT t to get an ODE! Woopsie-daisy. Very robust way to solve NL equations.

So, the answer is yes. It is even possible to perform optimisation using homotopy.
Last edited by Cuchulainn on November 21st, 2017, 5:07 pm, edited 5 times in total.
http://www.datasimfinancial.com
http://www.datasim.nl

Every Time We Teach a Child Something, We Keep Him from Inventing It Himself
Jean Piaget

Cuchulainn
Posts: 61155
Joined: July 16th, 2004, 7:38 am
Location: Amsterdam
Contact:

### Re: DL and PDEs

Are ODEs any good for DL? i.e. can we model DL by ODEs of weights? I don't think it's rocket science to say yes/no
https://arxiv.org/pdf/1703.02009.pdf

I mean equations (1)-(4). Why?
http://www.datasimfinancial.com
http://www.datasim.nl

Every Time We Teach a Child Something, We Keep Him from Inventing It Himself
Jean Piaget

outrun
Posts: 4573
Joined: April 29th, 2016, 1:40 pm

### Re: DL and PDEs

Another cool idea is to try and flip the sign of the loss function and use a maximixer in stead of a minimizer!

Posts: 23951
Joined: September 20th, 2002, 8:30 pm

### Re: DL and PDEs

Are ODEs any good for solving simultaneous equations? (It may be a stupid question. I don't know the answer to that because I've never looked.)

This question has been posed and solved on my famous and extremely popular $e^5$ thread. It is one of the 19 solutions. Have you forgotten? I posted a link already.
(It's a wee bit silly, that question, not because it is stupid (which it is not), but it is more of a lightning deflector).

Now that you mention it ...

More generally, you can embed a nonlinear system f(x) = 0 in a homotopy/continuation

H(x,t) = (1-t) { f(x) - f(x_0)} + t f(x), t in [0,1].

Differentiate WRT t to get an ODE! Woopsie-daisy. Very robust way to solve NL equations.

So, the answer is yes. It is even possible to perform optimisation using homotopy.
Very interesting.

But does it translate to simultaneous equations involving minimization rather than equality?

Posts: 23951
Joined: September 20th, 2002, 8:30 pm

### Re: DL and PDEs

Another cool idea is to try and flip the sign of the loss function and use a maximixer in stead of a minimizer!
In theory, one can map the loss function through any monotonic function which would modulate all the slopes in the function as well as the differential heights/depths of local minima, saddles, etc. without changing the rank-order of those surface features.

outrun
Posts: 4573
Joined: April 29th, 2016, 1:40 pm

### Re: DL and PDEs

Another cool idea is to try and flip the sign of the loss function and use a maximixer in stead of a minimizer!
In theory, one can map the loss function through any monotonic function which would modulate all the slopes in the function as well as the differential heights/depths of local minima, saddles, etc. without changing the rank-order of those surface features.
Yes, that's why *log* likelihood optimization is equivalent.

outrun
Posts: 4573
Joined: April 29th, 2016, 1:40 pm

### Re: DL and PDEs

For those who are also getting a bit bored with all this crazyness, the NIPS 2017 submissions are in: https://papers.nips.cc/book/advances-in ... ms-30-2017
SELU is already in Tensorflow 1.4

Cuchulainn
Posts: 61155
Joined: July 16th, 2004, 7:38 am
Location: Amsterdam
Contact:

### Re: DL and PDEs

For those who are also getting a bit bored with all this crazyness,  the NIPS 2017 submissions are in: https://papers.nips.cc/book/advances-in ... ms-30-2017
SELU is already in Tensorflow 1.4
Wow. More is less? Don't see any ODE, DL/PDE in there.
http://www.datasimfinancial.com
http://www.datasim.nl

Every Time We Teach a Child Something, We Keep Him from Inventing It Himself
Jean Piaget

outrun
Posts: 4573
Joined: April 29th, 2016, 1:40 pm

### Re: DL and PDEs

For those who are also getting a bit bored with all this crazyness,  the NIPS 2017 submissions are in: https://papers.nips.cc/book/advances-in ... ms-30-2017
SELU is already in Tensorflow 1.4
Wow. More is less? Don't see any ODE, DL/PDE in there.
Don't let that stop you, go for it

Posts: 23951
Joined: September 20th, 2002, 8:30 pm

### Re: DL and PDEs

For those who are also getting a bit bored with all this crazyness,  the NIPS 2017 submissions are in: https://papers.nips.cc/book/advances-in ... ms-30-2017
SELU is already in Tensorflow 1.4
Wow. More is less? Don't see any ODE, DL/PDE in there.
Do believe ODE/PDE are required for understanding DL or might it be just one lens among many?

outrun
Posts: 4573
Joined: April 29th, 2016, 1:40 pm

### Re: DL and PDEs

For those who are also getting a bit bored with all this crazyness,  the NIPS 2017 submissions are in: https://papers.nips.cc/book/advances-in ... ms-30-2017
SELU is already in Tensorflow 1.4
Wow. More is less? Don't see any ODE, DL/PDE in there.
Do believe ODE/PDE are required for understanding DL or might it be just one lens among many?
I think it's very import what Cuch is working on, nobody has ever thought about the basic math
Last edited by outrun on November 22nd, 2017, 1:34 am, edited 1 time in total.

Posts: 23951
Joined: September 20th, 2002, 8:30 pm

### Re: DL and PDEs

Wow. More is less? Don't see any ODE, DL/PDE in there.
Do believe ODE/PDE are required for understanding DL or might it be just one lens among many?
I think it's very import what Cuch is working on, nobody has every thought about the basic math
I agree that his goals are very laudable and I do hope his methods bear fruit. Alas, I fear that his methods will either lead nowhere (the proof he seeks does not exist) or they will require simplification of the DL algorithms that have been superseded by current successful algorithms or require restrictions in the underlying learned system that are not provable from the data and that merely displace the robustness question to that layer.

But I sincerely hope he proves me wrong!

Cuchulainn
Posts: 61155
Joined: July 16th, 2004, 7:38 am
Location: Amsterdam
Contact:

### Re: DL and PDEs

Are ODEs any good for solving simultaneous equations? (It may be a stupid question. I don't know the answer to that because I've never looked.)

This question has been posed and solved on my famous and extremely popular $e^5$ thread. It is one of the 19 solutions. Have you forgotten? I posted a link already.
(It's a wee bit silly, that question, not because it is stupid (which it is not), but it is more of a lightning deflector).

Now that you mention it ...

More generally, you can embed a nonlinear system f(x) = 0 in a homotopy/continuation

H(x,t) = (1-t) { f(x) - f(x_0)} + t f(x), t in [0,1].

Differentiate WRT t to get an ODE! Woopsie-daisy. Very robust way to solve NL equations.

So, the answer is yes. It is even possible to perform optimisation using homotopy.
Very interesting.

But does it translate to simultaneous equations involving minimization rather than equality?
For linear systems this homotopy could be used but it is overkill I suppose, For overdetermined or ill-posed systems least squares is good.  Use DE to avoid gradients?
http://www.datasimfinancial.com
http://www.datasim.nl

Every Time We Teach a Child Something, We Keep Him from Inventing It Himself
Jean Piaget

Cuchulainn
Posts: 61155
Joined: July 16th, 2004, 7:38 am
Location: Amsterdam
Contact:

### Re: DL and PDEs

Wow. More is less? Don't see any ODE, DL/PDE in there.
Do believe ODE/PDE are required for understanding DL or might it be just one lens among many?
I think it's very import what Cuch is working on, nobody has ever thought about the basic math
I am trying fist of all to learn DL (it is very interesting), so i try to connect up to the matter from my own background. So, it might come across as being awkward but I wish to flag each assumption.
http://www.datasimfinancial.com
http://www.datasim.nl

Every Time We Teach a Child Something, We Keep Him from Inventing It Himself
Jean Piaget

outrun
Posts: 4573
Joined: April 29th, 2016, 1:40 pm

### Re: DL and PDEs

Do believe ODE/PDE are required for understanding DL or might it be just one lens among many?
I think it's very import what Cuch is working on, nobody has ever thought about the basic math
I am trying fist of all to learn DL (it is very interesting), so i try to connect up to the matter from my own background. So, it might come across as being awkward but I wish to flag each assumption.
Fair enough. There is of course *lots* material you can read about any narrow aspect in the field

IMO you're too much focusing on loss minimization without picking up on the fact that the goal is never to find the absolute minimum of the loss, you ingnore the statistical side.

The field is called *statistical* machine learning because you want to learn models from finite data sets.You have a finite set of high dimensional {x,y} pairs with unknown relation between them. The goals if to find a function y_ = f(x,theta) that predict y best given x. You have to pick a functional form for f and then find the theta. f could be a line and theta slope + inception without changing the argument. If you use DL then the functional form is a NN with multiple layers, you have to pick the topology, activation functions etc. For finding theta you can use all sort of methods, including trying out random theta values and see which give the best result. That works really well if you surface is very rough. In DL people try to find good activation functions that shape the loss function to make it work nicely with GD. It's a computational metric: smooth is good, fast gradient computation is good. Thing are pulled together intelligently: if you aren't allowed to use GD (or can't) then you pick a different type of f or loss function and make it perform well under those constraints. E.g. for some functions you can't specify a loss function.

The point where "statistical" comes is in that you have a finite set of samples. A large network would be able to memorize all samples which would give zero loss, the ultimate lowest point in the loss surface! But you don't want to memorize, you want to general structure in the data that links x to y and use that to find good y' values when presented with *new* {x'} samples. When training a NN you always monitor generalisation vs memorizing. The stop condition depends mostly on that and not much on the loss property. This is also one of the reasons people use stochastic gradient descent instead of full gradient descent (this has analogies to bootstap aggregation): by changing the subset of {x,y} pair in every gradient step you change the loss surface in every step. This helps the NN not getting anal on 'the one and true loss function that it wants to memorize", it's helps conveying the fact that you have uncertaintly in the loss surface dues to sample noise (finite set of training samples). The other reason is computational constraints: you have GPUs with limited memory and a large dataset you can't load into the GPU memory so you process data in small batches.
For me the most interesting research area at the moment is the information theoretical angle on learning. When you train a Deep NN it need to do two things: learn to extract features from x and then use those features to predict y. A recent paper triggered lots of discussion: https://arxiv.org/abs/1703.00810 and I think that innovation wrt training will come from it.

ps: for GD the biggest issue is saddlepoints, not local minima. In high dimensions you'll find much less locations where all the gradient have the same sign than places where the gradients have different signs. All sorts of classical tricks are applied to try and tackle that: adding noise, adding momentum,..