SERVING THE QUANTITATIVE FINANCE COMMUNITY

outrun
Posts: 4573
Joined: April 29th, 2016, 1:40 pm

### Re: exp(5) = $e^5$

GD > SD

Edit: GD has many variants, SD is one specific version,
The literature contradicts all this. There are lots of ways to choose $\lambda$ in SD,.See Wikipedia.

In NN people almost always use SGD
SGD is an extension/optiimisation of GD.
More a "type of".
in particular SGD != SD, .. by definition!

Cuchulainn
Posts: 62936
Joined: July 16th, 2004, 7:38 am
Location: Amsterdam
Contact:

### Re: exp(5) = $e^5$

GD > SD

Edit: GD has many variants, SD is one specific version,
The literature contradicts all this. There are lots of ways to choose $\lambda$ in SD,.See Wikipedia.

In NN people almost always use SGD
SGD is an extension/optiimisation of GD.
More a "type of".
in particular SGD != SD, .. by definition!
True. But that was not the question.
Step over the gap, not into it. Watch the space between platform and train.
http://www.datasimfinancial.com
http://www.datasim.nl

outrun
Posts: 4573
Joined: April 29th, 2016, 1:40 pm

### Re: exp(5) = $e^5$

The literature contradicts all this. There are lots of ways to choose $\lambda$ in SD,.See Wikipedia.

In NN people almost always use SGD
SGD is an extension/optiimisation of GD.
More a "type of".
in particular SGD != SD, .. by definition!
True. But that was not the question.
But that's what you claimed a couple of time already in posts, and in a tone as if you were knowledgable about this all!

Comments like "it's simple: GD==SD" or things along the lines if " SD is what people in AI call it GD, I hate their abuse of terminology". It looks like you think you can only exist in negative space?

ExSan
Posts: 4583
Joined: April 12th, 2003, 10:40 am

### Re: exp(5) = $e^5$

Cuchulainn
Posts: 62936
Joined: July 16th, 2004, 7:38 am
Location: Amsterdam
Contact:

### Re: exp(5) = $e^5$

velore?? valore?
'K' e fissato? sono fissato con $e^5$
Step over the gap, not into it. Watch the space between platform and train.
http://www.datasimfinancial.com
http://www.datasim.nl

ISayMoo
Posts: 2366
Joined: September 30th, 2015, 8:30 pm

### Re: exp(5) = $e^5$

But don't draw general conclusions on this plot. It's a 2d projection of a million dimensional space, both might still be smooth in high dimensions.

Also,  when looking for a local minimum you would want to preferable land in a flat plateau, not a thin hole. The reason is that a flat plateaus means that the performance of model is somewhat insensitive to changes in parameters (moving around the plateau) .. a form of generalization. Whether there are plateaus or not depends on both the type of problem but also on the topology of the network you choose and activation function you choose. Those are the relevant things to focus on.
I remembered that this often expressed opinion was also repeated on this forum... short story: it's false. For every flat minimum you usually have an equivalent sharp minimum, and vice versa.

outrun
Posts: 4573
Joined: April 29th, 2016, 1:40 pm

### Re: exp(5) = $e^5$

But don't draw general conclusions on this plot. It's a 2d projection of a million dimensional space, both might still be smooth in high dimensions.

Also,  when looking for a local minimum you would want to preferable land in a flat plateau, not a thin hole. The reason is that a flat plateaus means that the performance of model is somewhat insensitive to changes in parameters (moving around the plateau) .. a form of generalization. Whether there are plateaus or not depends on both the type of problem but also on the topology of the network you choose and activation function you choose. Those are the relevant things to focus on.
I remembered that this often expressed opinion was also repeated on this forum... short story: it's false. For every flat minimum you usually have an equivalent sharp minimum, and vice versa.
Interesting. It's clear than you can reparametrize.. My intuition would still be that SGD would converge more likely to flat solution than the equivalent sharp solutions, i.e. if you train a NN N times with SGD, in what type of environment would it land?

Posts: 23951
Joined: September 20th, 2002, 8:30 pm

### Re: exp(5) = $e^5$

A few thoughts on sharpness-flatness:

1) In a deeply layered network, a subtle and relatively soft nonlinearity in each layer might amplify up the layers to create an extremely sharp function -- approaching a step function -- between a deep input and a distant output. Yet the exact location of the step might be a relatively mild function of any of the parameters in those layers (although the parameter values would be strongly correlated across the layers). Thus sharpness can be emergent across an extended area of the net rather than always localized and optimize to a few nodes.

2) A sufficiently large NN has so many DOF, that it can express many different solution architectures that have equivalent output performance despite having very different internal structures to the parameters and flatness/sharpness properties. Perhaps one reason for the robustness of performance is that these systems have so many DOF that if one part of the net gets locked into a suboptimal local minimum, some other part of the net finds a better minimum or compensating function.

4) Some aspects of flatness<->sharpness are intrinsic to the system. If you are attempting to predict the phase of water as a function of temperature under equilibrium conditions, the transitions between solid, liquid, and gas are perfectly sharp -- 100.1°C water will ALWAYS be a gas. But if you are attempting to predict the phase of water as a function of temperature under dynamic, non-equilibrium conditions, the transitions between solid, liquid, and gas are mushy -- 100.1°C water might be liquid in a large fraction of the training set and in real life.

Cuchulainn
Posts: 62936
Joined: July 16th, 2004, 7:38 am
Location: Amsterdam
Contact:

### Re: exp(5) = $e^5$

But don't draw general conclusions on this plot. It's a 2d projection of a million dimensional space, both might still be smooth in high dimensions.

Also,  when looking for a local minimum you would want to preferable land in a flat plateau, not a thin hole. The reason is that a flat plateaus means that the performance of model is somewhat insensitive to changes in parameters (moving around the plateau) .. a form of generalization. Whether there are plateaus or not depends on both the type of problem but also on the topology of the network you choose and activation function you choose. Those are the relevant things to focus on.
I remembered that this often expressed opinion was also repeated on this forum... short story: it's false. For every flat minimum you usually have an equivalent sharp minimum, and vice versa.
Fuzzy-wuzzy IMO. What I miss in general is a lack of sharp definitions (no pun intended) and references to the maths on which the article is based. It is very difficult to check if it is true what is being advocated..

Can you not 'get rid' of gradients?
Exercise: how does the method work on optimising the toy example  $(log(x) - 5)^2$?

//
Our work focuses on this particular conjecture, arguing that there are critical issues when applying the concept of flat minima to deep neural networks, which require rethinking what flatness actually means.

Very disconcerting.
Step over the gap, not into it. Watch the space between platform and train.
http://www.datasimfinancial.com
http://www.datasim.nl

Posts: 23951
Joined: September 20th, 2002, 8:30 pm

### Re: exp(5) = $e^5$

But don't draw general conclusions on this plot. It's a 2d projection of a million dimensional space, both might still be smooth in high dimensions.

Also,  when looking for a local minimum you would want to preferable land in a flat plateau, not a thin hole. The reason is that a flat plateaus means that the performance of model is somewhat insensitive to changes in parameters (moving around the plateau) .. a form of generalization. Whether there are plateaus or not depends on both the type of problem but also on the topology of the network you choose and activation function you choose. Those are the relevant things to focus on.
I remembered that this often expressed opinion was also repeated on this forum... short story: it's false. For every flat minimum you usually have an equivalent sharp minimum, and vice versa.
Fuzzy-wuzzy IMO. What I miss in general is a lack of sharp definitions (no pun intended) and references to the maths on which the article is based. It is very difficult to check if it is true what is being advocated..

Can you not 'get rid' of gradients?
Exercise: how does the method work on optimising the toy example  $(log(x) - 5)^2$?

//
Our work focuses on this particular conjecture, arguing that there are critical issues when applying the concept of flat minima to deep neural networks, which require rethinking what flatness actually means.

Very disconcerting.
What proof do you have that feasibility or performance on "toy examples" has any predictive value for feasibility or performance on large examples? There are many oft-used methods in statistics that only work on large sample sizes.

Cuchulainn
Posts: 62936
Joined: July 16th, 2004, 7:38 am
Location: Amsterdam
Contact:

### Re: exp(5) = $e^5$

OK, give a big example if you like.And goto step 1 again. There are several concerns pending.
Step over the gap, not into it. Watch the space between platform and train.
http://www.datasimfinancial.com
http://www.datasim.nl

ISayMoo
Posts: 2366
Joined: September 30th, 2015, 8:30 pm

### Re: exp(5) = $e^5$

But don't draw general conclusions on this plot. It's a 2d projection of a million dimensional space, both might still be smooth in high dimensions.

Also,  when looking for a local minimum you would want to preferable land in a flat plateau, not a thin hole. The reason is that a flat plateaus means that the performance of model is somewhat insensitive to changes in parameters (moving around the plateau) .. a form of generalization. Whether there are plateaus or not depends on both the type of problem but also on the topology of the network you choose and activation function you choose. Those are the relevant things to focus on.
I remembered that this often expressed opinion was also repeated on this forum... short story: it's false. For every flat minimum you usually have an equivalent sharp minimum, and vice versa.
Fuzzy-wuzzy IMO. What I miss in general is a lack of sharp definitions (no pun intended) and references to the maths on which the article is based. It is very difficult to check if it is true what is being advocated..

Can you not 'get rid' of gradients?
Exercise: how does the method work on optimising the toy example  $(log(x) - 5)^2$?

//
Our work focuses on this particular conjecture, arguing that there are critical issues when applying the concept of flat minima to deep neural networks, which require rethinking what flatness actually means.

Very disconcerting.
What "method"? The paper doesn't advocate a new method of optimisation.

ISayMoo
Posts: 2366
Joined: September 30th, 2015, 8:30 pm

### Re: exp(5) = $e^5$

A few thoughts on sharpness-flatness:

1) In a deeply layered network, a subtle and relatively soft nonlinearity in each layer might amplify up the layers to create an extremely sharp function -- approaching a step function -- between a deep input and a distant output.  Yet the exact location of the step might be a relatively mild function of any of the parameters in those layers (although the parameter values would be strongly correlated across the layers).  Thus sharpness can be emergent across an extended area of the net rather than always localized and optimize to a few nodes.
You're talking about something else than the paper I cited.
2) A sufficiently large NN has so many DOF, that it can express many different solution architectures that have equivalent output performance despite having very different internal structures to the parameters and flatness/sharpness properties. Perhaps one reason for the robustness of performance is that these systems have so many DOF that if one part of the net gets locked into a suboptimal local minimum, some other part of the net finds a better minimum or compensating function.
That would explain successful training, not successful generalisation. A common mistake...
4) Some aspects of flatness<->sharpness are intrinsic to the system.  If you are attempting to predict the phase of water as a function of temperature under equilibrium conditions, the transitions between solid, liquid, and gas are perfectly sharp -- 100.1°C water will ALWAYS be a gas.  But if you are attempting to predict the phase of water as a function of temperature under dynamic, non-equilibrium conditions, the transitions between solid, liquid, and gas are mushy -- 100.1°C water might be liquid in a large fraction of the training set and in real life.
Again, you're talking about a different "sharpness" than the article I cited.

Dudes, can you please read the paper before commenting? Surprisingly, this time it was outrun who was the closes to getting the point.

ISayMoo
Posts: 2366
Joined: September 30th, 2015, 8:30 pm

### Re: exp(5) = $e^5$

Interesting. It's clear than you can reparametrize.. My intuition would still be that SGD would converge more likely to flat solution than the equivalent sharp solutions, i.e. if you train a NN N times with SGD, in what type of environment would it land?
Indeed, they treat all equivalent parameterisations on equal footing, regardless of how "reachable" they are via SGD.

Cuchulainn
Posts: 62936
Joined: July 16th, 2004, 7:38 am
Location: Amsterdam
Contact:

### Re: exp(5) = $e^5$

You are so cute when you are angry? No dudes here, mate. Anyway, how do you know how many times we read it? I hope that's not the answer you give in a live seminar.

EXAMPLE: I READ THE DL-PAPER + RELATED AT LEAST 4 TIMES + SENT QUERIES TO THE AUTHORS TWICE. NO RESPONSE.

Once bitten, twice shy. There's some alchemy in some papers. Maybe you know the answers to 1..5 below. I and my MSc students have worked on each of these topics.

//

Hello,
I have a number of questions and remarks on the article. In general, I feel the numerical methods are difficult to use or just wrong:

1. Picard/Banach does not converge exponentially as you claim?
2. Finding a contraction  is difficult (sometimes impossible). To he honest, I am sceptical
3. Multilevel Monte Carlo is slow from my experience. Is the underlying maths also robust?
4. The PDEs used are esoteric and not mainstream. Why not start with 3d heat equation, Black Scholes, Heston?
5. I don't see sharp error estimates.

Maybe I am missing something.

//
FYI I have both a MSC and PhD In numerical analysis, both original research.
BTW are these articles you post refereed?
Step over the gap, not into it. Watch the space between platform and train.
http://www.datasimfinancial.com
http://www.datasim.nl