SERVING THE QUANTITATIVE FINANCE COMMUNITY

Cuchulainn
Posts: 57641
Joined: July 16th, 2004, 7:38 am
Location: Amsterdam
Contact:

### Re: If you are bored with Deep Networks

As I said, all this stuff is Euler method (duh).

Here is the first step to solve an ODE (is AD the way to do it here??)

https://arxiv.org/pdf/1806.07366.pdf

Cuchulainn
Posts: 57641
Joined: July 16th, 2004, 7:38 am
Location: Amsterdam
Contact:

### Re: If you are bored with Deep Networks

It's a mystery.

Just to be clear: I was joking. The point is that people choose a heuristic learning rate and some value ranges "usually work". But everyone knows that this is not how things should be done, and the theory of optimisation in statistical learning is an active research field.
Well, I thought it was a joke. And quite a good one. Isn’t much of numerical analysis and optimization like that? Rather arbitrary. One of the many reasons I don’t like all these subjects.
It only looks that way. You can prove convergence of FEM schemes in weighted Sobolev spaces but 1) it takes longer 2) someone has to pay..

AI numerical methods for AI are somewhat outdated to a lesser or greater extent.

Cuchulainn
Posts: 57641
Joined: July 16th, 2004, 7:38 am
Location: Amsterdam
Contact:

### Re: If you are bored with Deep Networks

Part of the documentation of the routine LBFGS by J. Nocedal (one of the giants in the field of nonlinear numerical optimization):

GTOL is a DOUBLE PRECISION variable with default value 0.9, which
C        controls the accuracy of the line search routine MCSRCH. If the
C        function and gradient evaluations are inexpensive with respect
C        to the cost of the iteration (which is sometimes the case when
C        solving very large problems) it may be advantageous to set GTOL
C        to a small value. A typical small value is 0.1.  Restriction:
C        GTOL should be greater than 1.D-04.
Why 0.9? Why 0.1? Why greater than 1e-4? You can shoot off the same questions at LBFGS as the ones Cuch throws at SGD. SGD doesn't have an automated way of setting the learning rate, so it's "dumb". Methods like LBFGS contain an algorithm to automatically set the learning rate, but this algorithms in 99 cases out of 100 contain some hyper-parameters which you either adjust to the every problem or set to some "typical" value. If you're lucky, there's a theorem which tells you the bounds in which you have to fit. But because this is hidden somewhere in the bowels of ancient Fortran library, people naively think it "just works". Just like SGD, it works until it doesn't. There's no magical way around the problem that if you're optimising a function based on point estimates, you have a learning problem and the no-free lunch theorem comes down on you like a ton of bricks.
If you study the book by Nocedal and Wright (BTW I have) you will see that there are more nuanced approaches as well.

ISayMoo
Topic Author
Posts: 1231
Joined: September 30th, 2015, 8:30 pm

### Re: If you are bored with Deep Networks

If you study the book by Nocedal and Wright (BTW I have) you will see that there are more nuanced approaches as well.
I know this book, I implemented some algorithms from it. Can you tell me what general purpose non-linear optimisation methods described in it are fully automatic and have no user-adjustable parameters?

Cuchulainn
Posts: 57641
Joined: July 16th, 2004, 7:38 am
Location: Amsterdam
Contact:

### Re: If you are bored with Deep Networks

If you study the book by Nocedal and Wright (BTW I have) you will see that there are more nuanced approaches as well.
I know this book, I implemented some algorithms from it. Can you tell me what general purpose non-linear optimisation methods described in it are fully automatic and have no user-adjustable parameters?
I would say the backtracking discrete methods (e.g. page 37 etc.). The continuous analogues (ODE solvers) have all this built-on so as 'user' you just give a tolerance and the solver does the rest instead of having to choose from a palette of learning rates. Backtracking is well-established in numerical analysis.

But if by 'user' you mean something else that's another discussion.
Last edited by Cuchulainn on December 17th, 2018, 3:18 pm, edited 1 time in total.

Cuchulainn
Posts: 57641
Joined: July 16th, 2004, 7:38 am
Location: Amsterdam
Contact:

### Re: If you are bored with Deep Networks

“"Neural networks" are a sad misnomer. They're neither neural nor even networks. They're chains of differentiable, parameterized geometric functions,  trained with gradient descent (with gradients obtained via the chain rule). A small set of highschool-level ideas put together.” — François Chollet

https://en.wikipedia.org/wiki/Keras

Cuchulainn
Posts: 57641
Joined: July 16th, 2004, 7:38 am
Location: Amsterdam
Contact:

### Re: If you are bored with Deep Networks

Just like SGD, it works until it doesn't.

Sounds defeatist.
In mathematics, conditions are stated under which an algorithm works for a given class of problems. It will not works when the problem does not satisfy the assumption. For example,

$x_{k+1} = f(x_{k}), k = 0,1,2, ..$ only converges when $f$ is a contraction.

I see it as: for what class of problems does SGD work? I don't think this has been done, at least not in the spate of articles to date. For example, SGD only finds local minimum. And then SGD for constrained optimisation is a Pandora's box, yes? It gets very fuzzy-fuzzy?

ODE solvers take into account the problem they are trying to solve and adjust parameters accordingly.

ISayMoo
Topic Author
Posts: 1231
Joined: September 30th, 2015, 8:30 pm

### Re: If you are bored with Deep Networks

If you study the book by Nocedal and Wright (BTW I have) you will see that there are more nuanced approaches as well.
I know this book, I implemented some algorithms from it. Can you tell me what general purpose non-linear optimisation methods described in it are fully automatic and have no user-adjustable parameters?
I would say the backtracking discrete methods (e.g. page 37 etc.). The continuous analogues (ODE solvers) have all this built-on so as 'user' you just give a tolerance and the solver does the rest instead of having to choose from a palette of learning rates. Backtracking is well-established in numerical analysis.
Hmm. In my edition (2nd, Springer, 2006) page 37 has Algorithm 3.1 which has 2 hyperparameters (c and rho). What is the theory for choosing the best values of c and rho?