The engineer says: "My equations are a model of the universe."

The physicist says: "The universe is a model of my equations."

The mathematician says: "I don't care."

Physicists have bouts of anxiety.

- Cuchulainn
**Posts:**62386**Joined:****Location:**Amsterdam-
**Contact:**

The engineer says: "My equations are a model of the universe."

The physicist says: "The universe is a model of my equations."

The mathematician says: "I don't care."

Physicists have bouts of anxiety.

The physicist says: "The universe is a model of my equations."

The mathematician says: "I don't care."

Physicists have bouts of anxiety.

Cuchulainn: "Physicists have bouts of anxiety. "

it is known as the Heisenberg’s uncertainty principle. I am sure it has an upper limit. so you can stop being so anxious that your electrons could be outside the visible universe at this moment in time.

it is known as the Heisenberg’s uncertainty principle. I am sure it has an upper limit. so you can stop being so anxious that your electrons could be outside the visible universe at this moment in time.

- Traden4Alpha
**Posts:**23951**Joined:**

It's worse than Heisenberg! When it comes to getting tenure, a physicist knows neither their position nor their momentum!Cuchulainn: "Physicists have bouts of anxiety. "

it is known as the Heisenberg’s uncertainty principle. I am sure it has an upper limit. so you can stop being so anxious that your electrons could be outside the visible universe at this moment in time.

- Cuchulainn
**Posts:**62386**Joined:****Location:**Amsterdam-
**Contact:**

[$]\pi^5[$] is the area of a circle of radius [$] \pi^2[$]. Probably a useless result.

More generally, [$]f(x) = \pi^x[$] satisfies the ODE

[$]df/dx = a f, f(0) = 1[$] where [$]a = log(\pi)[$].

I would solve in multiprecision in NDSolve, may better than raw multiplication.

Is [$]\pi^e < e^\pi[$], what?

More generally, [$]f(x) = \pi^x[$] satisfies the ODE

[$]df/dx = a f, f(0) = 1[$] where [$]a = log(\pi)[$].

I would solve in multiprecision in NDSolve, may better than raw multiplication.

Is [$]\pi^e < e^\pi[$], what?

- Cuchulainn
**Posts:**62386**Joined:****Location:**Amsterdam-
**Contact:**

Code: Select all

```
*TOL and[min, max] for Brent: 1e-05, [-3, 3]
* Error and number of iterations needed : 9.76004e-06, 127
(1, 16.1407)(2, 11.9878)(3, 9.50992)(4, 7.79936)(5, 6.52887)(6, 5.54231)(7, 4.75291)(8, 4.10753)(9, 3.57131)
(10, 3.12019)(11, 2.73685)(12, 2.40849)(13, 2.12536)(14, 1.87987)(15, 1.66604)(16, 1.47904)(17, 1.31497)
(18, 1.17059)(19, 1.04322)(20, 0.930608)(21, 0.830865)(22, 0.742369)(23, 0.663738)(24, 0.593783)(25, 0.531476)
(26, 0.475926)(27, 0.426356)(28, 0.382087)(29, 0.342526)(30, 0.307149)(31, 0.275497)(32, 0.247163)(33, 0.221789)
(34, 0.199056)(35, 0.178683)(36, 0.160418)(37, 0.14404)(38, 0.129348)(39, 0.116168)(40, 0.10434)(41, 0.0937248)
(42, 0.0841958)(43, 0.0756407)(44, 0.0679591)(45, 0.0610609)(46, 0.0548657)(47, 0.0493011)(48, 0.0443027)
(49, 0.0398125)(50, 0.0357786)(51, 0.0321542)(52, 0.0288978)(53, 0.0259718)(54, 0.0233425)(55, 0.0209798)
(56, 0.0188566)(57, 0.0169485)(58, 0.0152337)(59, 0.0136925)(60, 0.0123075)(61, 0.0110626)(62, 0.00994371)
(63, 0.00893807)(64, 0.0080342)(65, 0.00722178)(66, 0.00649154)(67, 0.00583518)(68, 0.0052452)(69, 0.0047149)
(70, 0.00423823)(71, 0.00380976)(72, 0.00342461)(73, 0.00307842)(74, 0.00276722)(75, 0.00248749)(76, 0.00223604)
(77, 0.00201002)(78, 0.00180684)(79, 0.0016242)(80, 0.00146003)(81, 0.00131245)(82, 0.00117979)(83, 0.00106054)
(84, 0.000953348)(85, 0.000856988)(86, 0.000770368)(87, 0.000692504)(88, 0.00062251)(89, 0.000559591)
(90, 0.000503031)(91, 0.000452189)(92, 0.000406485)(93, 0.000365401)(94, 0.000328469)(95, 0.000295271)(96, 0.000265427)
(97, 0.0002386)(98, 0.000214485)(99, 0.000192807)(100, 0.00017332)(101, 0.000155802)(102, 0.000140055)(103, 0.0001259)
(104, 0.000113175)(105, 0.000101736)(106, 9.1454e-05)(107, 8.22108e-05)(108, 7.39017e-05)(109, 6.64325e-05)
(110, 5.97182e-05)(111, 5.36825e-05)(112, 4.82568e-05)(113, 4.33796e-05)(114, 3.89952e-05)(115, 3.5054e-05)
(116, 3.15111e-05)(117, 2.83263e-05)(118, 2.54634e-05)(119, 2.28898e-05)(120, 2.05763e-05)(121, 1.84967e-05)
(122, 1.66273e-05)(123, 1.49468e-05)(124, 1.34361e-05)(125, 1.20781e-05)(126, 1.08574e-05)(127, 9.76004e-06)
Initial guess : [1](50.6)
Final result : [1](148.413)
Function value : 3.42116e-13
```

[$](log y - 5)^2[$]

Here is the output. Notice it is impervious to pesky learning rate because I use Brent instead.

SD computes [$]\lambda[$] using a 1d solver should as Brent, Golden Mean, Fibonacci etc. and a local minimum is good enough. With GD you guess a [$]\lambda[$] and see how it goes.

no, again,

Steepest Descent != Gradient Descent

Gradient Descent != AI

The subfield of*Machine Learning* called *supervised learning in Neural Networks* uses variants on *stochastic* gradient descent. .. which is *very* different you should know before talking about it.

Steepest Descent != Gradient Descent

Gradient Descent != AI

The subfield of

- Cuchulainn
**Posts:**62386**Joined:****Location:**Amsterdam-
**Contact:**

Thank you for the history lesson. But wait a minute, all the media call it AI. So, I will stop. Mind you, Goodfellow in his book calls it "no, again,

Steepest Descent != Gradient Descent

Gradient Descent != AI

The subfield ofMachine Learningcalledsupervised learning in Neural Networksuses variants onstochasticgradient descent. .. which is *very* different you should know before talking about it.

Correction; GD is used in ML/DL. Better?

- Cuchulainn
**Posts:**62386**Joined:****Location:**Amsterdam-
**Contact:**

This doesn't tell me much. In which sense not? BTW, you are wrong Read this

From what I have read they are variants of more general

https://en.wikipedia.org/wiki/Gradient_method

You want good use of terminology no?

ML is a collection of statistical method where a program solves problems based on learning from data using generic learnjng methods, as opposite to postulating model or using expert opinion and then code it up. In the past engineers spent lots of times postulating and implementing filters to manipulate data based on personal views and experience. Think eg about speech recognition or synthesising in the 80s (Hawking's voice).

Some popular methods in ML are random forests, support vector machines, clustering algorithms, kNN and NN. Some of these are called supervised learning, others unsupervised. None of these use derivatives of a loss function except the last.

DL is neural networks with more than a couple of layers. The most commonly supervised learning method with NN is backpropagation. Its use derivatives to learn and there are two main elements:

1) what samples do you use to compute an update? there are 3 variants of how to update your model parameters based in different ways to select you sample data that you use for training. 1.1 batch gradient descent uses all data samples. 1.2 stochastic gradient descent uses one data point during each update, 1.3 mini-batch gradient descent uses a subset of samples for each update. In NN people always use 1.3. in finance (eg calibration) people typically use 1.1

2) once gradients are computer based on a set of samples and computed loss then there are various methods on how much you update the model parameters. Currently the most popular choice (but it's not as clear cut as the previous choice) is ADAM which is a momentum methods. There are many others, each with their own hyperparameters. Some eg have a learning rate, but others not.

For DL there are some aditional challenged because you can get into trouble when you start tweaking on the many layers simultaneously . These challenges are addressed with topology tricks (eg skip connections), special activation functions (like SELU) and other techniques like batch normalization. Note that this is only an issue of DL. Shallow NN don't have these issues.

Learning (rate) tweaking is not a way to solve these DL problems, the above mentioned methods however work nicely. Below eg the impact of skip connections topological changes on the loss function of a 56 layer NN

ML is a collection of statistical method where a program solves problems based on learning from data using generic learnjng methods, as opposite to postulating model or using expert opinion and then code it up. In the past engineers spent lots of times postulating and implementing filters to manipulate data based on personal views and experience. Think eg about speech recognition or synthesising in the 80s (Hawking's voice).

Some popular methods in ML are random forests, support vector machines, clustering algorithms, kNN and NN. Some of these are called supervised learning, others unsupervised. None of these use derivatives of a loss function except the last.

DL is neural networks with more than a couple of layers. The most commonly supervised learning method with NN is backpropagation. Its use derivatives to learn and there are two main elements:

1) what samples do you use to compute an update? there are 3 variants of how to update your model parameters based in different ways to select you sample data that you use for training. 1.1 batch gradient descent uses all data samples. 1.2 stochastic gradient descent uses one data point during each update, 1.3 mini-batch gradient descent uses a subset of samples for each update. In NN people always use 1.3. in finance (eg calibration) people typically use 1.1

2) once gradients are computer based on a set of samples and computed loss then there are various methods on how much you update the model parameters. Currently the most popular choice (but it's not as clear cut as the previous choice) is ADAM which is a momentum methods. There are many others, each with their own hyperparameters. Some eg have a learning rate, but others not.

For DL there are some aditional challenged because you can get into trouble when you start tweaking on the many layers simultaneously . These challenges are addressed with topology tricks (eg skip connections), special activation functions (like SELU) and other techniques like batch normalization. Note that this is only an issue of DL. Shallow NN don't have these issues.

Learning (rate) tweaking is not a way to solve these DL problems, the above mentioned methods however work nicely. Below eg the impact of skip connections topological changes on the loss function of a 56 layer NN

Last edited by outrun on January 6th, 2018, 1:48 pm, edited 1 time in total.

But don't draw general conclusions on this plot. It's a 2d projection of a million dimensional space, both might still be smooth in high dimensions.

Also, when looking for a local minimum you would want to preferable land in a flat plateau, not a thin hole. The reason is that a flat plateaus means that the performance of model is somewhat insensitive to changes in parameters (moving around the plateau) .. a form of generalization. Whether there are plateaus or not depends on both the type of problem but also on the topology of the network you choose and activation function you choose. Those are the relevant things to focus on.

Also, when looking for a local minimum you would want to preferable land in a flat plateau, not a thin hole. The reason is that a flat plateaus means that the performance of model is somewhat insensitive to changes in parameters (moving around the plateau) .. a form of generalization. Whether there are plateaus or not depends on both the type of problem but also on the topology of the network you choose and activation function you choose. Those are the relevant things to focus on.

- Cuchulainn
**Posts:**62386**Joined:****Location:**Amsterdam-
**Contact:**

These posts - although related - don't answer the question. It is very simple GD == SD.

GD > SD

Edit: GD has many variants, SD is one specific version, and as it turns out never used in NN.

In NN people almost always use SGD which is different variant as described above in 1.3

Edit: GD has many variants, SD is one specific version, and as it turns out never used in NN.

In NN people almost always use SGD which is different variant as described above in 1.3

- Cuchulainn
**Posts:**62386**Joined:****Location:**Amsterdam-
**Contact:**

The literature contradicts all this. There are lots of ways to choose [$]\lambda[$] in SD,.See Wikipedia.GD > SD

Edit: GD has many variants, SD is one specific version,

SGD is an extension/optiimisation of GD.

Last edited by Cuchulainn on January 6th, 2018, 3:25 pm, edited 1 time in total.

- Traden4Alpha
**Posts:**23951**Joined:**

The literature uses inconsistent terminology or, more correctly, the literature reuses terminology within a scoped namespace.The literature contradicts all this.GD > SD

Edit: GD has many variants, SD is one specific version,

There's a very large family of methods that numerically compute gradients and numerically manage traversal of a search space based on computed gradients. Some are deterministic, some as stochastic, some are myopic, some attempt to break out of local extrema, some use fixed parameters, some use user-provided parameters, some have meta-algorithms to optimize the parameters, etc.

You used one variant for e^5, deep learning uses other variants, and some types of machine learning don't use gradient methods at all.

- Cuchulainn
**Posts:**62386**Joined:****Location:**Amsterdam-
**Contact:**

Variant of what? I used Steepest Descent with Brent for [$]\lambda[$]. I could have use at least 5 other variants. For later.

i

You're not wrong.

Last edited by Cuchulainn on January 6th, 2018, 3:55 pm, edited 2 times in total.

GZIP: On