Did you guys discuss this here already?

https://arxiv.org/pdf/1706.04702.pdf

Did you guys discuss this here already?

https://arxiv.org/pdf/1706.04702.pdf

https://arxiv.org/pdf/1706.04702.pdf

- Cuchulainn
**Posts:**59684**Joined:****Location:**Amsterdam-
**Contact:**

We knows nothing about this. It's kind of pure maths/university research.

Good compelling examples missing.

Good compelling examples missing.

Many people try such things now. What I find a bit weird in their implementation is that they seem to have a recurrent structure in the solver, but do not make use of this fact in the code. They just stack layer after layer, one for each time step. It won't scale.

- Cuchulainn
**Posts:**59684**Joined:****Location:**Amsterdam-
**Contact:**

One point is not to succumb to what happened to String Theory, i.e. unrealistic expectations. AI has a track record but maybe it us better to focus on what it does best (and what it does not do best).

I can't imagine a NN being more efficient and accurate than a standard finite difference method. But you never know.

I am reading Goodfellow et al. One first impression (which is allowed!) is that the underlying maths (chapter 4, for example) is fairly basic and somewhat outdated. But I need to read more of course. Is there no alternative to gradient-based methods here? Gradients are badly-behaved objects.They fly off the handle so easily.

I can't imagine a NN being more efficient and accurate than a standard finite difference method. But you never know.

I am reading Goodfellow et al. One first impression (which is allowed!) is that the underlying maths (chapter 4, for example) is fairly basic and somewhat outdated. But I need to read more of course. Is there no alternative to gradient-based methods here? Gradients are badly-behaved objects.They fly off the handle so easily.

I practically created this thread for you.

I don't think that it is known what it does best and what not. I could imagine that for very high-dimensional problems the universal function approximation properties of neural nets could well be useful in various areas outside of computer vision.

Your call on the math being outdated and fairly basic is missing the point a bit I think. The main problem with most people in this forum as I have been complaining about for years if you read some threads I contributed into, is that it is filled with people who are essentially hung up irreversibly on parametric methods. It makes things looks exceedingly old-fashioned. It's pretty good that some people have no taken an interest into non-parametric methods. This is what machine learning is, it's basically non-parametric statistical modeling and gradient descent is extremely general, so quite appropriate in such a general setting. The math in this book isn't exactly simplistic btw.

My sense is that ML in finance is already past its prime. Success on the buy-side hasn't been spectacular. Banks haven't even started failing yet. I think things swung from extremely parametric models over into extremely non-parametric models. The truth lies hopefully somewhere in between.

I don't think that it is known what it does best and what not. I could imagine that for very high-dimensional problems the universal function approximation properties of neural nets could well be useful in various areas outside of computer vision.

Your call on the math being outdated and fairly basic is missing the point a bit I think. The main problem with most people in this forum as I have been complaining about for years if you read some threads I contributed into, is that it is filled with people who are essentially hung up irreversibly on parametric methods. It makes things looks exceedingly old-fashioned. It's pretty good that some people have no taken an interest into non-parametric methods. This is what machine learning is, it's basically non-parametric statistical modeling and gradient descent is extremely general, so quite appropriate in such a general setting. The math in this book isn't exactly simplistic btw.

My sense is that ML in finance is already past its prime. Success on the buy-side hasn't been spectacular. Banks haven't even started failing yet. I think things swung from extremely parametric models over into extremely non-parametric models. The truth lies hopefully somewhere in between.

- Cuchulainn
**Posts:**59684**Joined:****Location:**Amsterdam-
**Contact:**

I hope others can feel it is for them as well. I don't have a case I worked out myself but that should not stop our asking questions such as:

1. Why not use Differential Evolution as well as Gradient Descent (saying it's slow is disingenuous)

2. I would like C++ as well as Python. Is Python slow?

3. Is DL for more than computer vision?

4. A 101 example A-Z just to show how it works.

5. I had some exposure to topology in previous life: somehow TDA feels more robust than universal approximators.

These are genuine questions.

1. Why not use Differential Evolution as well as Gradient Descent (saying it's slow is disingenuous)

2. I would like C++ as well as Python. Is Python slow?

3. Is DL for more than computer vision?

4. A 101 example A-Z just to show how it works.

5. I had some exposure to topology in previous life: somehow TDA feels more robust than universal approximators.

These are genuine questions.

1. NN are typically 10k to 10mln dimensional functions. DE would be very slow, you would need at least the double of number agents but in practice many more or else you would be searching in a subspace. Each agent represent an instance of the network, so big memory consumption and computations on it.

NN have lots of paths to minima, they have eg lots of permutation invariants. You also don't want to find the global minimum because you would be overfitting. The most common method is *stochastic* gradient descent because the values function is not smooth and convex.

2. All the popular framework have bindings to popular languages like python and c++. The backend are BLAS and CUDA libraries where 99% of the code executing will be.

3. Yes, anything where finding representations of data is a good candidate.

4. You have a good book, start coding and experimenting. The course of Andrew Ng is really a good intro.

5. Imo the challenges are in the loss functions, and topologies. People look at NN as networks with valves, intersections, where information flow, gradient management. Unsupervised learning is the most exciting area at the moment.

NN have lots of paths to minima, they have eg lots of permutation invariants. You also don't want to find the global minimum because you would be overfitting. The most common method is *stochastic* gradient descent because the values function is not smooth and convex.

2. All the popular framework have bindings to popular languages like python and c++. The backend are BLAS and CUDA libraries where 99% of the code executing will be.

3. Yes, anything where finding representations of data is a good candidate.

4. You have a good book, start coding and experimenting. The course of Andrew Ng is really a good intro.

5. Imo the challenges are in the loss functions, and topologies. People look at NN as networks with valves, intersections, where information flow, gradient management. Unsupervised learning is the most exciting area at the moment.

You can try a simple fully connected NN of eg 3 layers.

Input a vector x of length 256. Each layer does y =max(Ax+b,0), with A a 256x256 matrix, b a vector of length 256. The y output of layer L in the x input of layer L+1.

So, you have a 256 vector input, then a bunch of layer, and then a 256 vector output. Try to teach it to map some 256d input points to some 256d output points. Eg, provide it with 100.000 generated x,y pairs of some function. Maybe sin(tf) framents as input and teach it to output cos(5tf)

A first questions:

How many variables does this NN have?

Input a vector x of length 256. Each layer does y =max(Ax+b,0), with A a 256x256 matrix, b a vector of length 256. The y output of layer L in the x input of layer L+1.

So, you have a 256 vector input, then a bunch of layer, and then a 256 vector output. Try to teach it to map some 256d input points to some 256d output points. Eg, provide it with 100.000 generated x,y pairs of some function. Maybe sin(tf) framents as input and teach it to output cos(5tf)

A first questions:

How many variables does this NN have?

Geoff Hinton (the guy who invented back-propagation) kind of agrees with you: https://www.axios.com/ai-pioneer-advoca ... 37027.htmlIs there no alternative to gradient-based methods here? Gradients are badly-behaved objects.They fly off the handle so easily.

People use evolutionary algorithms to search for better network architectures, but practically everyone trains a NN using Stochastic Gradient Descent.I hope others can feel it is for them as well. I don't have a case I worked out myself but that should not stop our asking questions such as:

1. Why not use Differential Evolution as well as Gradient Descent (saying it's slow is disingenuous)

It is, that's why TensorFlow is C++ under the hood.2. I would like C++ as well as Python. Is Python slow?

Of course! Speech recognition, stochastic control, playing games...3. Is DL for more than computer vision?

Try this: https://www.tensorflow.org/get_started/mnist/beginners4. A 101 example A-Z just to show how it works.

And... surprise! Neural Networks are not robust: https://blog.openai.com/adversarial-example-research/5. I had some exposure to topology in previous life: somehow TDA feels more robust than universal approximators.

These are genuine questions.

Adversarial examples are a big topic in AI research now.

- katastrofa
**Posts:**8038**Joined:****Location:**Alpha Centauri

As much as a lot of ML looks to me like a set of not-so-sophisticated methods of overfitting (a.k.a. "universal representation"), there are some areas in which their efficiency can make a difference, e.g. faster algorithms of solving density functional theory (the fundamental method of modelling all sorts of many-body systems and the basis for developing new materials). Otherwise, the ML researchers seem to either fly away in the same direction as the ST people or appear utterly arrogant by claiming that they developed basic risk-score methods, Voronoi tessellation, etc. (I'm teasing ISayMoo)

@ISayMoo "Adversarial examples are a big topic in AI research now."

That's the thing. A lot of most urgent real-life problems are about dealing with fat-tail risks, e.g. a chihuahua vs a muffin or a camera vs a missile launcher.

@ISayMoo "Adversarial examples are a big topic in AI research now."

That's the thing. A lot of most urgent real-life problems are about dealing with fat-tail risks, e.g. a chihuahua vs a muffin or a camera vs a missile launcher.

- Cuchulainn
**Posts:**59684**Joined:****Location:**Amsterdam-
**Contact:**

1. NN are typically 10k to 10mln dimensional functions. DE would be very slow, you would need at least the double of number agents but in practice many more or else you would be searching in a subspace. Each agent represent an instance of the network, so big memory consumption and computations on it.

I don't get this answer, not at all. Are you saying finding a local minimum is OK?

1. In every branch, we desire a global minimum >> local minimum, even in DL (have a look at Goodfellow figure 4.3 page 81).

2. Gradient descent methods fined local minimum, not necessarily global. Is that serious?

3, Overfitting is caused by high-order polynomials (yes?). I don't see what the relationship is with finding minima.

4. More evidence is needed on "how slow" DE is (the good news is that always give a global minimum).

5. The Cybenko universal approximation theorem seems to have little coupling to anything in mainstream numerical approximation. Maybe it is not necessary, but maybe say that. Borel measures and numerical accuracy are not a good mix IMO.

Mathematically, it feels that this approach is not even wrong..

I don't get this answer, not at all. Are you saying finding a local minimum is OK?

1. In every branch, we desire a global minimum >> local minimum, even in DL (have a look at Goodfellow figure 4.3 page 81).

2. Gradient descent methods fined local minimum, not necessarily global. Is that serious?

3, Overfitting is caused by high-order polynomials (yes?). I don't see what the relationship is with finding minima.

4. More evidence is needed on "how slow" DE is (the good news is that always give a global minimum).

5. The Cybenko universal approximation theorem seems to have little coupling to anything in mainstream numerical approximation. Maybe it is not necessary, but maybe say that. Borel measures and numerical accuracy are not a good mix IMO.

Mathematically, it feels that this approach is not even wrong..

Last edited by Cuchulainn on November 8th, 2017, 10:12 am, edited 6 times in total.

- Cuchulainn
**Posts:**59684**Joined:****Location:**Amsterdam-
**Contact:**

This is very general. Reminds me of the 90s with OOT 1) everything is an object, 2) objects are for the plucking (easy to find).

What is needed is to list the

I am a DL noobie so maybe all these questions have already been addressed..

- Cuchulainn
**Posts:**59684**Joined:****Location:**Amsterdam-
**Contact:**

I may have missed a point, especially since you didn't give one. All I see is a link to a not useful article (that I waded in, not even a hint to pinpoint what's on your mind) and DL/PDE buzzwords in the title.

What's the question centering around DL/PDE,exactly? A guess; does DL solve curse of dimensionality?

// is there a GOOD paper on DL/PDE. 1st impression is it's a solution looking for a problem. 'Teaching' a pde sounds a bit weird.

You don't want a global minimum on the training set, because that would be overfitting: early stopping1. NN are typically 10k to 10mln dimensional functions. DE would be very slow, you would need at least the double of number agents but in practice many more or else you would be searching in a subspace. Each agent represent an instance of the network, so big memory consumption and computations on it.

I don't get this answer, not at all. Are you saying finding a local minimum is OK?

1. In every branch, we desire a global minimum >> local minimum, even in DL (have a look at Goodfellow figure 4.3 page 81).

Yes.2. Gradient descent methods fined local minimum, not necessarily global. Is that serious?

Your points feel the same way to me too more precision would be welcome3, Overfitting is caused by high-order polynomials (yes?). I don't see what the relationship is with finding minima.

4. More evidence is needed on "how slow" DE is (the good news is that always give a global minimum).

5. The Cybenko universal approximation theorem seems to have little coupling to anything in mainstream numerical approximation. Maybe it is not necessary, but maybe say that. Borel measures and numerical accuracy are not a good mix IMO.

Mathematically, it feels that this approach is not even wrong..

GZIP: On