- katastrofa
**Posts:**8732**Joined:****Location:**Alpha Centauri

Quantum annealers (e.g. D-wave) find global minima. What deep networks may do, unlike classical algorithms, is finding correlations of the minima positions (more complex than e.g. momentum methods) - it's a quantum algorithm's capability. Introducing memory might possibly help too...

- Cuchulainn
**Posts:**61085**Joined:****Location:**Amsterdam-
**Contact:**

Just one thing .. NN use discrete data structures. Here and there you see ODEs being used.

Thoughts?

*Another source of continuous-nonlinear RNNs arose through a study of adaptive behavior in real time, which led to the derivation of neural networks that form the foundation of most current biological neural network research (Grossberg, 1967, 1968b, 1968c). These laws were discovered in 1957-58 when Grossberg, then a college Freshman, introduced the paradigm of ***using nonlinear systems of differential equations to model how brain mechanisms can control behavioral functions. **The laws were derived from an analysis of how psychological data about human and animal learning can arise in an individual learner adapting autonomously in real time. Apart from the Rockefeller Institute student monograph Grossberg (1964), it took a decade to get them published.

I feel less nervous with ODE (more robust) than with Hessians and gradients. But they might be unavoidable?

*and ?*

*The ***counterpropagation network** is a hybrid network. It consists of an outstar network and a competitive filter network. It was developed in 1986 by Robert Hecht-Nielsen. It is guaranteed to find the correct weights, unlike regular back propagation networks that can **become trapped in local minimums during training.**

This sounds reasonable but experts' opinions would be welcome.

Thoughts?

I feel less nervous with ODE (more robust) than with Hessians and gradients. But they might be unavoidable?

This sounds reasonable but experts' opinions would be welcome.

http://www.datasimfinancial.com

http://www.datasim.nl

Every Time We Teach a Child Something, We Keep Him from Inventing It Himself

Jean Piaget

http://www.datasim.nl

Every Time We Teach a Child Something, We Keep Him from Inventing It Himself

Jean Piaget

I had that book in the 90, the Kohonen Network is a 1 hot encoder, very inefficient. At the time I worked in extended it to a m over n code and full 2^n code using a kernel trick to cast point into high dimensions. Learning was however very unstable.

If we have 100 samples drawn from some distribution and some flexible function, how would you fit that function to make it represent the distribution the samples came from? That's the basic statistical view you first need to understand before you can think about the local minima (which are not relevant in practice).

What would eg be the best fit in your opinion? Why would you want to fit it, why not just have a lookup table with your samples?

If we have 100 samples drawn from some distribution and some flexible function, how would you fit that function to make it represent the distribution the samples came from? That's the basic statistical view you first need to understand before you can think about the local minima (which are not relevant in practice).

What would eg be the best fit in your opinion? Why would you want to fit it, why not just have a lookup table with your samples?

- Cuchulainn
**Posts:**61085**Joined:****Location:**Amsterdam-
**Contact:**

What's wrong with a global minimum? Or is some defence for the fact that gradient descent only finds local minima?

I asked this question about 5 times already but no answer to date. NN literature suggest locals are sub-optimal.

What is this saying? Local minima are bad?

http://www.datasimfinancial.com

http://www.datasim.nl

Every Time We Teach a Child Something, We Keep Him from Inventing It Himself

Jean Piaget

http://www.datasim.nl

Every Time We Teach a Child Something, We Keep Him from Inventing It Himself

Jean Piaget

- katastrofa
**Posts:**8732**Joined:****Location:**Alpha Centauri

What's the dimension of a typical NN, ISayMoo? I presume it's high. Ergo, how many minima of either kind we can expect to find there? My another question seems similar to Cuchulainn's, in model selection (which is in a way what NNs do), the "best" model is bad, because it always promotes overfitting. Don't we have the same problem with a global minimum?

- Cuchulainn
**Posts:**61085**Joined:****Location:**Amsterdam-
**Contact:**

For example, an eggholder function? Not to mention yuge gradients.

http://www.datasimfinancial.com

http://www.datasim.nl

Every Time We Teach a Child Something, We Keep Him from Inventing It Himself

Jean Piaget

http://www.datasim.nl

Every Time We Teach a Child Something, We Keep Him from Inventing It Himself

Jean Piaget

- katastrofa
**Posts:**8732**Joined:****Location:**Alpha Centauri

I would expect that there very few minima because of the high dimension, but maybe my intuition is completely wrong...

- Cuchulainn
**Posts:**61085**Joined:****Location:**Amsterdam-
**Contact:**

It's got lots of minima!I would expect that there very few minima because of the high dimension, but maybe my intuition is completely wrong...

BTW my DE algo fails for the Griewank function

https://www.sfu.ca/~ssurjano/griewank.html

http://www.datasim.nl

Every Time We Teach a Child Something, We Keep Him from Inventing It Himself

Jean Piaget

I have the same feeling, the chance of having zero derivatives in all direction in high dimensions is nearly zero. Also recent research on that (and also older in physics): https://stats.stackexchange.com/questio ... lue-to-theI would expect that there very few minima because of the high dimension, but maybe my intuition is completely wrong...

The most widely used method is s*tochastic* gradient, not full batch gradient descent. A typical NN is between 10k and 10mln dimensional.

If you fit a mixtures density to a small set of points then the global optimum (of maximum likelihood) will be a set of Dirac delta functions, it won't likely perform any good when evaluated against new samples!

If you fit a mixtures density to a small set of points then the global optimum (of maximum likelihood) will be a set of Dirac delta functions, it won't likely perform any good when evaluated against new samples!

- Cuchulainn
**Posts:**61085**Joined:****Location:**Amsterdam-
**Contact:**

Talk is cheap. Prove it.I have the same feeling, the chance of having zero derivatives in all direction in high dimensions is nearly zero. Also recent research on that (and also older in physics): https://stats.stackexchange.com/questio ... lue-to-the

Last edited by Cuchulainn on November 13th, 2017, 8:39 pm, edited 1 time in total.

http://www.datasim.nl

Every Time We Teach a Child Something, We Keep Him from Inventing It Himself

Jean Piaget

Some people say it doesQuantum annealers (e.g. D-wave) find global minima. What deep networks may do, unlike classical algorithms, is finding correlations of the minima positions (more complex than e.g. momentum methods) - it's a quantum algorithm's capability. Introducing memory might possibly help too...

" Our analysis suggests that the convergence issues may be fixed by endowing such algorithms with “long-term memory” of past gradients"

(The criticism I heard of their fix is that it fixes convergence in some cases, but makes it worse in others.)

- Cuchulainn
**Posts:**61085**Joined:****Location:**Amsterdam-
**Contact:**

Ergo, the method is not robust.Some people say it doesQuantum annealers (e.g. D-wave) find global minima. What deep networks may do, unlike classical algorithms, is finding correlations of the minima positions (more complex than e.g. momentum methods) - it's a quantum algorithm's capability. Introducing memory might possibly help too...

" Our analysis suggests that the convergence issues may be fixed by endowing such algorithms with “long-term memory” of past gradients"

(The criticism I heard of their fix is that it fixes convergence in some cases, but makes it worse in others.)

Rule in mathematics: if you make an algo easy in one aspect it will make it difficult somewhere else. Essential difficulties remain.

Nothing wrong with throwing the kitchen sink at the problem but it has to be more than fixes if you want to go into one of them robot cars.

http://www.datasim.nl

Every Time We Teach a Child Something, We Keep Him from Inventing It Himself

Jean Piaget

- Traden4Alpha
**Posts:**23951**Joined:**

We don't have true mathematical foundations in any science, only a best-fit-so-far set of theories expressed in math. Admitted, those best-fit-so-far maths are amazingly accurate. But are the correct? The entire history of science is just a series of discarded "foundations" with no guarantees that today's math is correct.I matters a lot! If you don't understand the mathematical foundations, you're groping around in the dark and progress is very slow. People who want to push AI forward are very keen on understanding the mathematical foundations of NN learning, because only then will we be able to create systems which can "learn to learn", and create AGI (artificial general intelligence).Ah, but does it matter? If the goal is science, then "yes". If the goal is practical solutions, then "no".tl&dr: we don't understand how deep networks learn

At best, the only foundation in science is the growing set of observations and experimental outcomes that must explained by whatever theories du jour are bubbling up. The monotonic growth is data seems to suggest there's monotonic growth in confidence in the math du jour but there's no guarantee that some observation tomorrow won't totally destroy todays foundation (except as a convenient approximation like F-ma).

We know less about brains than we know about neural nets which isn't surprising in that artificial neural nets are modeled on natural neural nets. Certainly there's no solid mathematical foundation for the brain. We don't know why we must sleep (which seems like an extremely maladaptive thing to do). We don't know how anesthetics and pain killers actually work. And what mathematical foundation predicts the placebo effect? And what the hell are all those glial cells and astrocytes doing?We do have some clues, and I wouldn't say that we are "perfectly comfortable" with the current state of knowledge about how ours (and other animals') brains work - we don't know how to treat depression and other mental disorders, we don't know how to optimally teach and train people, etc.Humans seem perfectly comfortable using their brains despite having no clue how they work.

GZIP: On