SERVING THE QUANTITATIVE FINANCE COMMUNITY

Cuchulainn
Posts: 55999
Joined: July 16th, 2004, 7:38 am
Location: Amsterdam
Contact:

### Re: exp(5) = $e^5$

The engineer says: "My equations are a model of the universe."
The physicist says: "The universe is a model of my equations."

The mathematician says: "I don't care."

Physicists have bouts of anxiety.

Collector
Posts: 3626
Joined: August 21st, 2001, 12:37 pm

### Re: exp(5) = $e^5$

Cuchulainn: "Physicists have bouts of anxiety. "

it is known as the Heisenberg’s uncertainty principle. I am sure it has an upper limit. so you can stop being so anxious that your electrons could be outside the visible universe at this moment in time.

Posts: 23951
Joined: September 20th, 2002, 8:30 pm

### Re: exp(5) = $e^5$

Collector wrote:
Cuchulainn: "Physicists have bouts of anxiety. "

it is known as the Heisenberg’s uncertainty principle. I am sure it has an upper limit. so you can stop being so anxious that your electrons could be outside the visible universe at this moment in time.
It's worse than Heisenberg! When it comes to getting tenure, a physicist knows neither their position nor their momentum!

Cuchulainn
Posts: 55999
Joined: July 16th, 2004, 7:38 am
Location: Amsterdam
Contact:

### Re: exp(5) = $e^5$

$\pi^5$ is the area of a circle of radius $\pi^2$. Probably a useless result.

More generally, $f(x) = \pi^x$ satisfies the ODE

$df/dx = a f, f(0) = 1$ where $a = log(\pi)$.

I would solve in multiprecision in NDSolve, may better than raw multiplication.

Is $\pi^e < e^\pi$, what?

Cuchulainn
Posts: 55999
Joined: July 16th, 2004, 7:38 am
Location: Amsterdam
Contact:

### Re: exp(5) = $e^5$

  *TOL and[min, max] for Brent: 1e-05, [-3, 3]* Error and number of iterations needed : 9.76004e-06, 127(1, 16.1407)(2, 11.9878)(3, 9.50992)(4, 7.79936)(5, 6.52887)(6, 5.54231)(7, 4.75291)(8, 4.10753)(9, 3.57131)(10, 3.12019)(11, 2.73685)(12, 2.40849)(13, 2.12536)(14, 1.87987)(15, 1.66604)(16, 1.47904)(17, 1.31497)(18, 1.17059)(19, 1.04322)(20, 0.930608)(21, 0.830865)(22, 0.742369)(23, 0.663738)(24, 0.593783)(25, 0.531476)(26, 0.475926)(27, 0.426356)(28, 0.382087)(29, 0.342526)(30, 0.307149)(31, 0.275497)(32, 0.247163)(33, 0.221789)(34, 0.199056)(35, 0.178683)(36, 0.160418)(37, 0.14404)(38, 0.129348)(39, 0.116168)(40, 0.10434)(41, 0.0937248)(42, 0.0841958)(43, 0.0756407)(44, 0.0679591)(45, 0.0610609)(46, 0.0548657)(47, 0.0493011)(48, 0.0443027)(49, 0.0398125)(50, 0.0357786)(51, 0.0321542)(52, 0.0288978)(53, 0.0259718)(54, 0.0233425)(55, 0.0209798)(56, 0.0188566)(57, 0.0169485)(58, 0.0152337)(59, 0.0136925)(60, 0.0123075)(61, 0.0110626)(62, 0.00994371)(63, 0.00893807)(64, 0.0080342)(65, 0.00722178)(66, 0.00649154)(67, 0.00583518)(68, 0.0052452)(69, 0.0047149)(70, 0.00423823)(71, 0.00380976)(72, 0.00342461)(73, 0.00307842)(74, 0.00276722)(75, 0.00248749)(76, 0.00223604)(77, 0.00201002)(78, 0.00180684)(79, 0.0016242)(80, 0.00146003)(81, 0.00131245)(82, 0.00117979)(83, 0.00106054)(84, 0.000953348)(85, 0.000856988)(86, 0.000770368)(87, 0.000692504)(88, 0.00062251)(89, 0.000559591)(90, 0.000503031)(91, 0.000452189)(92, 0.000406485)(93, 0.000365401)(94, 0.000328469)(95, 0.000295271)(96, 0.000265427)(97, 0.0002386)(98, 0.000214485)(99, 0.000192807)(100, 0.00017332)(101, 0.000155802)(102, 0.000140055)(103, 0.0001259)(104, 0.000113175)(105, 0.000101736)(106, 9.1454e-05)(107, 8.22108e-05)(108, 7.39017e-05)(109, 6.64325e-05)(110, 5.97182e-05)(111, 5.36825e-05)(112, 4.82568e-05)(113, 4.33796e-05)(114, 3.89952e-05)(115, 3.5054e-05)(116, 3.15111e-05)(117, 2.83263e-05)(118, 2.54634e-05)(119, 2.28898e-05)(120, 2.05763e-05)(121, 1.84967e-05)(122, 1.66273e-05)(123, 1.49468e-05)(124, 1.34361e-05)(125, 1.20781e-05)(126, 1.08574e-05)(127, 9.76004e-06)Initial guess : [1](50.6)Final result : [1](148.413)Function value : 3.42116e-13

Everyone's talking about Steepest Descent (or Gradient Descent in AI). So why not minimise as a sanity check

$(log y - 5)^2$

Here is the output. Notice it is impervious to pesky learning rate because I use Brent instead.
SD computes $\lambda$ using a 1d solver should as Brent, Golden Mean, Fibonacci etc. and a local minimum is good enough. With GD you guess a $\lambda$ and see how it goes.

outrun
Posts: 4573
Joined: April 29th, 2016, 1:40 pm

### Re: exp(5) = $e^5$

no, again,

The subfield of Machine Learning called supervised learning in Neural Networks uses variants on stochastic gradient descent. .. which is *very* different you should know before talking about it.

Cuchulainn
Posts: 55999
Joined: July 16th, 2004, 7:38 am
Location: Amsterdam
Contact:

### Re: exp(5) = $e^5$

outrun wrote:
no, again,

The subfield of Machine Learning called supervised learning in Neural Networks uses variants on stochastic gradient descent. .. which is *very* different you should know before talking about it.

Thank you for the history lesson. But wait a minute, all the media call it AI. So, I will stop. Mind you, Goodfellow in his book calls it "AI deep learning"

Correction; GD is used in ML/DL. Better?

Cuchulainn
Posts: 55999
Joined: July 16th, 2004, 7:38 am
Location: Amsterdam
Contact:

### Re: exp(5) = $e^5$

This doesn't tell me much. In which sense not? BTW, you are wrong Read this

"Gradient descent is also known as steepest descent. However, gradient descent should not be confused with the method of steepest descent for approximating integrals."

From what I have  read they are variants of more general gradient methods. e.g. see Bertselkas..

outrun
Posts: 4573
Joined: April 29th, 2016, 1:40 pm

### Re: exp(5) = $e^5$

You want good use of terminology no?

ML is a collection of statistical method where a program solves problems based on learning from data using generic learnjng methods, as opposite to postulating model or using expert opinion and then code it up. In the past engineers spent lots of times postulating and implementing filters to manipulate data based on personal views and experience. Think eg about speech recognition or synthesising in the 80s (Hawking's voice).

Some popular methods in ML are random forests, support vector machines, clustering algorithms, kNN and NN. Some of these are called supervised learning, others unsupervised. None of these use derivatives of a loss function except the last.

DL is neural networks with more than a couple of layers. The most commonly supervised learning method with NN is backpropagation. Its use derivatives to learn and there are two main elements:
1) what samples do you use to compute an update? there are 3 variants of how to update your model parameters based in different ways to select you sample data that you use for training. 1.1 batch gradient descent uses all data samples. 1.2 stochastic gradient descent uses one data point during each update, 1.3 mini-batch gradient descent uses a subset of samples for each update. In NN people always use 1.3. in finance (eg calibration) people typically use 1.1
2) once gradients are computer based on a set of samples and computed loss then there are various methods on how much you update the model parameters. Currently the most popular choice (but it's not as clear cut as the previous choice) is ADAM which is a momentum methods. There are many others, each with their own hyperparameters. Some eg have a learning rate, but others not.

For DL there are some aditional challenged because you can get into trouble when you start tweaking on the many layers simultaneously . These challenges are addressed with topology tricks (eg skip connections), special activation functions (like SELU) and other techniques like batch normalization. Note that this is only an issue of DL. Shallow NN don't have these issues.

Learning (rate) tweaking is not a way to solve these DL problems, the above mentioned methods however work nicely. Below eg the impact of skip connections topological changes on the loss function of a 56 layer NN
Last edited by outrun on January 6th, 2018, 1:48 pm

outrun
Posts: 4573
Joined: April 29th, 2016, 1:40 pm

### Re: exp(5) = $e^5$

But don't draw general conclusions on this plot. It's a 2d projection of a million dimensional space, both might still be smooth in high dimensions.

Also, when looking for a local minimum you would want to preferable land in a flat plateau, not a thin hole. The reason is that a flat plateaus means that the performance of model is somewhat insensitive to changes in parameters (moving around the plateau) .. a form of generalization. Whether there are plateaus or not depends on both the type of problem but also on the topology of the network you choose and activation function you choose. Those are the relevant things to focus on.

Cuchulainn
Posts: 55999
Joined: July 16th, 2004, 7:38 am
Location: Amsterdam
Contact:

### Re: exp(5) = $e^5$

These posts - although related - don't answer the question. It is very simple GD == SD.

outrun
Posts: 4573
Joined: April 29th, 2016, 1:40 pm

### Re: exp(5) = $e^5$

GD > SD

Edit: GD has many variants, SD is one specific version, and as it turns out never used in NN.
In NN people almost always use SGD which is different variant as described above in 1.3

Cuchulainn
Posts: 55999
Joined: July 16th, 2004, 7:38 am
Location: Amsterdam
Contact:

### Re: exp(5) = $e^5$

outrun wrote:
GD > SD

Edit: GD has many variants, SD is one specific version,

The literature contradicts all this. There are lots of ways to choose $\lambda$ in SD,.See Wikipedia.

In NN people almost always use SGD
SGD is an extension/optiimisation of GD.
Last edited by Cuchulainn on January 6th, 2018, 3:25 pm

Posts: 23951
Joined: September 20th, 2002, 8:30 pm

### Re: exp(5) = $e^5$

Cuchulainn wrote:
outrun wrote:
GD > SD

Edit: GD has many variants, SD is one specific version,

The literature uses inconsistent terminology or, more correctly, the literature reuses terminology within a scoped namespace.

There's a very large family of methods that numerically compute gradients and numerically manage traversal of a search space based on computed gradients. Some are deterministic, some as stochastic, some are myopic, some attempt to break out of local extrema, some use fixed parameters, some use user-provided parameters, some have meta-algorithms to optimize the parameters, etc.

You used one variant for e^5, deep learning uses other variants, and some types of machine learning don't use gradient methods at all.

Cuchulainn
Posts: 55999
Joined: July 16th, 2004, 7:38 am
Location: Amsterdam
Contact:

### Re: exp(5) = $e^5$

You used one variant for e^5, deep learning uses other variants, and some types of machine learning don't use gradient methods at all.

Variant of what? I used Steepest Descent with Brent for $\lambda$. I could have use at  least 5 other variants. For later.

inconsistent terminology
You're not wrong.
Last edited by Cuchulainn on January 6th, 2018, 3:55 pm