SERVING THE QUANTITATIVE FINANCE COMMUNITY

 
User avatar
Cuchulainn
Posts: 59409
Joined: July 16th, 2004, 7:38 am
Location: Amsterdam
Contact:

Re: If you are bored with Deep Networks

April 2nd, 2019, 12:46 pm

Anonymous Quote

I know GD, but I don't have working experiences with GD. I've some experiences with EM algorithm applying to various Markov models. There are some points you mentioned I can relate as below:
  • Initial guess close to real solution (Analyse Numerique 101).
  • No guarantee that GD is applicable in the first place (assumes cost function is smooth).
  • Convergence to local minimum.
  • The method is iterative, so no true reliable quality of service (QOS).
  • It's not very robust
GD converges on non-differentiable functions as long as they're convex. E.g. it will handle f(x) = |x| as well as f(x) = x^2.
This (benign) example is a bit like the economist on the train to Scotland by train, looks out window and sees a black sheep. Thus all sheep are black, not. That sheep was a single sheep with one side black.
GD for discontinuous functions is ill-defined, e.g. Heaviside function. In fact, I would be surprised if your example works. It could be serendipitous (get lucky).
Nocedal//Wright briefly discuss non-smooth problems and sub-gradients.

I am assuming you implicitly agree with my other bullet points.

Have we forgotten exploding gradients??

https://en.wikipedia.org/wiki/Vanishing ... nt_problem

For non-differentiable functions, gradient methods are ill-defined. For locally Lipschitz problems and especially for convex minimization problems, bundle methods of descent are well-defined. Non-descent methods, like subgradient projection methods, may also be used. These methods are typically slower than gradient descent.
Last edited by Cuchulainn on April 2nd, 2019, 1:17 pm, edited 2 times in total.
 
User avatar
Cuchulainn
Posts: 59409
Joined: July 16th, 2004, 7:38 am
Location: Amsterdam
Contact:

Re: If you are bored with Deep Networks

April 2nd, 2019, 12:59 pm

GD converges on non-differentiable functions as long as they're convex. 

Non-differentiable optimisation ==> non-convex, sadly

https://optimization.mccormick.northwes ... timization

 f(x) = |x| 

Image
Last edited by Cuchulainn on April 2nd, 2019, 1:34 pm, edited 3 times in total.
 
User avatar
Cuchulainn
Posts: 59409
Joined: July 16th, 2004, 7:38 am
Location: Amsterdam
Contact:

Re: If you are bored with Deep Networks

April 2nd, 2019, 1:08 pm

It's a sad day if the conclusion is NN is essentially only suitable for differentiable convex function and unconstrained optimisation. Time for the mathematics to catch up, what? 
?

//
The assumptions underlying Wolfe and Armjo conditions don't hold anymore, arrivederci!!

On the other hand, software implementations of neural network training usually return one of the one-sided derivatives rather than reporting that the derivative is not defined or raising an error. This may be heuristically justified by observing that gradient-based optimization on a digital computer is subject to numerical error anyway. When a function is asked to evaluate g(0), it is very unlikely that the underlying value truly was 0 . Instead, it was likely to be some small value that was rounded to 0 . In some contexts, more theoretically pleasing justifications are available, but these usually do not apply to neural network training. The important point is that in practice one can safely disregard the non-differentiability of the hidden unit activation functions.
 
User avatar
ISayMoo
Topic Author
Posts: 1894
Joined: September 30th, 2015, 8:30 pm

Re: If you are bored with Deep Networks

April 2nd, 2019, 1:54 pm

GD converges on non-differentiable functions as long as they're convex. E.g. it will handle f(x) = |x| as well as f(x) = x^2.
This (benign) example is a bit like the economist on the train to Scotland by train, looks out window and sees a black sheep. Thus all sheep are black, not. That sheep was a single sheep with one side black.
GD for discontinuous functions is ill-defined, e.g. Heaviside function. In fact, I would be surprised if your example works. It could be serendipitous (get lucky).
Heaviside function is not convex. Convergence of GD for continuous, convex but non-differentiable functions can be proven. See chapter 14 of this book.
I am assuming you implicitly agree with my other bullet points.
Not yet ;-)
 
User avatar
Cuchulainn
Posts: 59409
Joined: July 16th, 2004, 7:38 am
Location: Amsterdam
Contact:

Re: If you are bored with Deep Networks

April 4th, 2019, 6:31 pm

Chapter 14 is a step in the right direction but it falls short as far as I can see 1)  it is a strange mix of (discrete) linear algebra and probability theory, 2) a lot hangs on the Lipschitz constant [$]\rho[$], c) finding the set of subgradients (the subdifferential set) is probably not easy.

I don't see how this algorithm can be automated well.

Discontinuous gradients in ODE gradients have been studied for at least 30 years. In particular, ODE solvers can find points of discontinuity (e.g.. by using boundary layer techniques) and act according. And it is hard maths underlying the material, e.g.

https://www.researchgate.net/publicatio ... _to_LS-SVM

A general feeling is that ML mathematical research is missing out. It feel like LinearAlgebra++. w/o real; analysis.
 
User avatar
ISayMoo
Topic Author
Posts: 1894
Joined: September 30th, 2015, 8:30 pm

Re: If you are bored with Deep Networks

April 6th, 2019, 3:46 pm

Of course it's probability theory - ML is a probabilistic discipline (even if it doesn't always show). Why should we avoid probabilities?

You don't need the full set of subgradients, just A subgradient. Finding one is usually easy.
 
User avatar
katastrofa
Posts: 7947
Joined: August 16th, 2007, 5:36 am
Location: Alpha Centauri

Re: If you are bored with Deep Networks

April 6th, 2019, 3:55 pm

I thought ML was a statistical (~= probabilistic) discipline.
 
User avatar
ISayMoo
Topic Author
Posts: 1894
Joined: September 30th, 2015, 8:30 pm

Re: If you are bored with Deep Networks

April 6th, 2019, 3:59 pm

I was afraid to use the word "statistical" :)

But I agree with Cuch that Chapter 14 in the book I cited is not enough. Firstly, the convexity requirement excludes deep networks and other interesting applications. Secondly, they use Polyak averaging (returning the average of all iterates), which is not what people actually do in practice.
 
User avatar
katastrofa
Posts: 7947
Joined: August 16th, 2007, 5:36 am
Location: Alpha Centauri

Re: If you are bored with Deep Networks

April 6th, 2019, 4:17 pm

I don't understand why convexity excludes deep networks, but I think Polyak averaging can't be used in non-convex problems.
 
User avatar
ISayMoo
Topic Author
Posts: 1894
Joined: September 30th, 2015, 8:30 pm

Re: If you are bored with Deep Networks

April 6th, 2019, 7:03 pm

The output of deep networks is not a convex function of the weights.
 
User avatar
ISayMoo
Topic Author
Posts: 1894
Joined: September 30th, 2015, 8:30 pm

Re: If you are bored with Deep Networks

April 6th, 2019, 8:34 pm

Re 2nd part of your post, you're made a very good point that Polyak averaging exploits the convexity of the minimised function, but AFAIK people sometimes treat these two things (convexity and using Pol. avg.) separately. See e.g. https://hal.archives-ouvertes.fr/hal-01623986
 
User avatar
Cuchulainn
Posts: 59409
Joined: July 16th, 2004, 7:38 am
Location: Amsterdam
Contact:

Re: If you are bored with Deep Networks

April 14th, 2019, 10:41 am

Examining chapter 14 (and some of the others) they do have a very pure maths/linear algebra feel about them (BTW nothing wrong with that, but everything in moderation). You are closer to the coal face but there are features of this approach that seem quite restrictive but it is 'one step too far' in the chain (it skips the continuous formulation and jumps directly into Euler method, essentially).

Some unanswered (or not posed) questions about (S)GD:

1. Global convergence. i.e. for any [$]x^{(0)}[$] not supported.
2. GD is a "greedy' algorithm and will converge to a local minimum. And then you are stuck there. 
3. Seems to be focused on differentiable convex functions. This is very restrictiive.
4. Supporting constraints and regularisation parameters is not clear how to proceed (beyond experimentation ad nauseam).
5. The gradient [$]g(x) = \nabla f(x)[$] is crucial but how do we compute it? AD, FDM, analytic, Complex Step Method. I've ven seen a recent finance paper on NN where the authors use cubic splines to compute gradients.,

Taking a step backwards, gradient system ODE and Langevin SDE are generalisations of GD. I applied it at your suggestion by solving the least squares linear regression problem as ODE/SGD.

viewtopic.php?f=34&t=101662

You really want 'automagic' energy-diminishing methods (and not tweaking hyperparameters).
//

No two languages are ever sufficiently similar to be considered as representing the same social reality. The worlds in which different societies live are distinct worlds, not merely the same world with different labels attached. Edward Sapir
 
User avatar
katastrofa
Posts: 7947
Joined: August 16th, 2007, 5:36 am
Location: Alpha Centauri

Re: If you are bored with Deep Networks

April 18th, 2019, 10:10 am

 
User avatar
Cuchulainn
Posts: 59409
Joined: July 16th, 2004, 7:38 am
Location: Amsterdam
Contact:

Re: If you are bored with Deep Networks

April 18th, 2019, 11:45 am

Interesting.
I will not buy this record, it is scratched.
 
User avatar
ISayMoo
Topic Author
Posts: 1894
Joined: September 30th, 2015, 8:30 pm

Re: If you are bored with Deep Networks

April 30th, 2019, 1:17 pm

DL as craft (1)
DL as craft (2)

Yes, it doesn't sound too scientific. "Become one with the data".
ABOUT WILMOTT

PW by JB

Wilmott.com has been "Serving the Quantitative Finance Community" since 2001. Continued...


Twitter LinkedIn Instagram

JOBS BOARD

JOBS BOARD

Looking for a quant job, risk, algo trading,...? Browse jobs here...


GZIP: On