Cybenko's theorem is an elegant piece of functional analysis.

Does it work in practice? How?

https://pdfs.semanticscholar.org/05ce/b ... 8c3fab.pdf

// It becomes clear why sigmoids are used.

- Cuchulainn
**Posts:**62617**Joined:****Location:**Amsterdam-
**Contact:**

Cybenko's theorem is an elegant piece of functional analysis.

Does it work in practice? How?

https://pdfs.semanticscholar.org/05ce/b ... 8c3fab.pdf

// It becomes clear why sigmoids are used.

Does it work in practice? How?

https://pdfs.semanticscholar.org/05ce/b ... 8c3fab.pdf

// It becomes clear why sigmoids are used.

Step over the gap, not into it. Watch the space between platform and train.

http://www.datasimfinancial.com

http://www.datasim.nl

http://www.datasimfinancial.com

http://www.datasim.nl

In the 80s-90s people were using sigmoids as activation functions for NN but it didn't work to well for deep neural networks.

One of the reason is called "the vanishing gradient problem". When training a NN you typically use gradient descent on a cost function over the output error of the network. The sigmoid function has a derivative that quickly goes to zero as you move away from zero in either direction. This means that the gradient goes to zero for a lot of neurons, and that hinders learning. This especially happens when you have deep networks (lots of layers of neurons, one layer connected to the other). One of the key drivers of the deep learning explosion is that people decided to replace the sigmoid with the RELU function (which is identical to the payoff function of a call option!). The RELU has simple, efficient computable gradients, and it also promote sparse encodings (situations where some neurons give a signal, and other are zero).

Sigmoid are also ditched as an activation function for neurons that need to generate outputs in the [-1,1] or [0,1] range. Instead people use tanh for that nowadays. The reason for this switch is that the sigmoid in non-symetrical (skewed) whereas tanh has more symmetry. This symmetry reduces bias in the gradients during training.

There is a nice discussion here starting a 14:00

One of the reason is called "the vanishing gradient problem". When training a NN you typically use gradient descent on a cost function over the output error of the network. The sigmoid function has a derivative that quickly goes to zero as you move away from zero in either direction. This means that the gradient goes to zero for a lot of neurons, and that hinders learning. This especially happens when you have deep networks (lots of layers of neurons, one layer connected to the other). One of the key drivers of the deep learning explosion is that people decided to replace the sigmoid with the RELU function (which is identical to the payoff function of a call option!). The RELU has simple, efficient computable gradients, and it also promote sparse encodings (situations where some neurons give a signal, and other are zero).

Sigmoid are also ditched as an activation function for neurons that need to generate outputs in the [-1,1] or [0,1] range. Instead people use tanh for that nowadays. The reason for this switch is that the sigmoid in non-symetrical (skewed) whereas tanh has more symmetry. This symmetry reduces bias in the gradients during training.

There is a nice discussion here starting a 14:00

.. the key point IMO is that you need a neural network of a least 2 layers for univeral approximation, this was prove in 1991 by Hornik, the sigmoid is not what makes it work, its' the >1 layers.

In the 60s DARPA funded a lot of research into the perceptron -which is a*single* layer NN- as a generic learning machine. However, in 1969 Marvin Minsky proved in a classic paper that the perceptron can't learn the XOR function (and hence no universal learning) https://www.quora.com/Why-cant-the-XOR- ... perceptron

In the 60s DARPA funded a lot of research into the perceptron -which is a

- Cuchulainn
**Posts:**62617**Joined:****Location:**Amsterdam-
**Contact:**

Looking at this from a mathematical/numerical analysis point of view: what they are doing here is an application/.examples of gradient descent IMO. Once you agree on that then some issues become clear and will surface in applications.

//

*One of the reason is called "the vanishing gradient problem"*

This is obvious problem using gradient descent methods. The problem is not peculiar to ANNs, and it's in very many areas of research.

What about "*the exploding gradient problem"?*

In general, there are miminisation methods that do not really on these gradient methods. e.g. DE. But I suppose gradient descent has become the de facto standard..

*The sigmoid function has a derivative that quickly goes to zero as you move away from zero in either direction. This means that the gradient goes to zero for a lot of neurons, and that hinders learning. *

OK. But is the sigmoid's fault or that gradient descent does not work with sigmoid? Like putting diesel into an electric car.

The tanh looks like a tweak.

My take: crudely put, gradient descent has its issues that don't go away on their own.

//

This is obvious problem using gradient descent methods. The problem is not peculiar to ANNs, and it's in very many areas of research.

What about "

In general, there are miminisation methods that do not really on these gradient methods. e.g. DE. But I suppose gradient descent has become the de facto standard..

OK. But is the sigmoid's fault or that gradient descent does not work with sigmoid? Like putting diesel into an electric car.

The tanh looks like a tweak.

My take: crudely put, gradient descent has its issues that don't go away on their own.

Step over the gap, not into it. Watch the space between platform and train.

http://www.datasimfinancial.com

http://www.datasim.nl

http://www.datasimfinancial.com

http://www.datasim.nl

I am guessing that's not without a good reason though. My (limited) experience with DE and gradient descent optimization(in the context of model calibration) is that the latter is orders of magnitude faster. It may be that for different kinds of problems the speed comparison is not that unfavorable for DE, don't know.But I suppose gradient descent has become the de facto standard..

Yes gradient descent with back-propagation is the most widely used method when training a neural networks with supervised learning.

later more discussion material!..

later more discussion material!..

- Traden4Alpha
**Posts:**23951**Joined:**

First, we seem to have a language issue here in that the paper talks about sigmoidal functions in a very general way rather than any specific S-shaped function. It's about "a sigmoid function" as opposed to "the sigmoid function". Tanh() fits the paper's definition of a sigmoid function.

Yes, it's clear that AND(the sigmoid, gradient descent) has poor performance. It's less clear whether that implies anything about the relative performance of AND(other cost functions, gradient descent) versus AND(the sigmoid, other minimizers).

I'd think that the sigmoid (and maybe many sigmoids) would be susceptible to BOTH the vanishing gradient and exploding gradient problems and that the one causes the other. The vanishing gradient issue causes instabilities in estimating the scale parameter and ill-conditioned scale values can induce exploding gradients. But maybe there's some clever minimization algorithms that can find minima in compositions of sigmoids although to the extent the system relies on empirical values to construct the cost function or define the learning data set, numerical instabilities could be a real problem.

It's also worth noting that the paper shows that sigmoidal functions can be used in principle to represent any function in an analytic sense but says nothing about the numerical stability of such representations or the computational complexity of such representations.

Ultimately, the proper course of action is in understanding real-world cost functions and showing whether/why they tend to have sigmoidal or RELU-like properties. Mathematical elegance means nothing if the solution is simply wrong.

Yes, it's clear that AND(the sigmoid, gradient descent) has poor performance. It's less clear whether that implies anything about the relative performance of AND(other cost functions, gradient descent) versus AND(the sigmoid, other minimizers).

I'd think that the sigmoid (and maybe many sigmoids) would be susceptible to BOTH the vanishing gradient and exploding gradient problems and that the one causes the other. The vanishing gradient issue causes instabilities in estimating the scale parameter and ill-conditioned scale values can induce exploding gradients. But maybe there's some clever minimization algorithms that can find minima in compositions of sigmoids although to the extent the system relies on empirical values to construct the cost function or define the learning data set, numerical instabilities could be a real problem.

It's also worth noting that the paper shows that sigmoidal functions can be used in principle to represent any function in an analytic sense but says nothing about the numerical stability of such representations or the computational complexity of such representations.

Ultimately, the proper course of action is in understanding real-world cost functions and showing whether/why they tend to have sigmoidal or RELU-like properties. Mathematical elegance means nothing if the solution is simply wrong.

- Cuchulainn
**Posts:**62617**Joined:****Location:**Amsterdam-
**Contact:**

Richard Feynman says

1. Make a guess

2. Work out a solution

3. If the output from 2 does not agree with reality then it is wrong (2, not reality) and GOTO 1.

In maths theorems we try to give necessary and sufficient conditions for some classes of problems to to have a solution. Then you take your specific test case to see if it satisfies the axioms:

1. yes, then try it

2. no, use something else or modify the problem input in some way.

More an issue of robustness in this case as well.

Gradient methods will give performance issues when initial guess is fah away from the exact solution.

Last edited by Cuchulainn on July 8th, 2017, 4:16 pm, edited 1 time in total.

Step over the gap, not into it. Watch the space between platform and train.

http://www.datasimfinancial.com

http://www.datasim.nl

http://www.datasimfinancial.com

http://www.datasim.nl

Sigh.

Maybe its best you learn more first before you complain? There is a lot to learn if you're looking at 90s papers.

Maybe its best you learn more first before you complain? There is a lot to learn if you're looking at 90s papers.

Homework for Cuch: this recent paper (jun 2017) is getting many people excited, it proposes SELU (instead of RELU, sigmoid). It works really well, I'm seeing very stable learning with deep networks.

You can go straight to the appendix with the proofs (page 9 ..100) that motivate why it should work that good.

https://arxiv.org/abs/1706.02515

You can go straight to the appendix with the proofs (page 9 ..100) that motivate why it should work that good.

https://arxiv.org/abs/1706.02515

- Traden4Alpha
**Posts:**23951**Joined:**

Proving that any function can be represented by sigmoids is necessary but not sufficient to showing that sigmoids are a good solution. Cybenko shows sigmoids can be used not that they should be used.

It would be interesting to collect all the Cybenko-like theorems that show how polynomials, sinusoids, wavelets, various splines, etc. have the potential to represent other functions either exactly or with well-bounded error. An interesting analog of this are the various proofs that certain classes of discrete mathematical systems are capable of universal computation.

And I totally agree about gradient methods. Worse, if the cost function has many minima, gradient methods can readily give high-confidence, low-correctness solutions.

It would be interesting to collect all the Cybenko-like theorems that show how polynomials, sinusoids, wavelets, various splines, etc. have the potential to represent other functions either exactly or with well-bounded error. An interesting analog of this are the various proofs that certain classes of discrete mathematical systems are capable of universal computation.

And I totally agree about gradient methods. Worse, if the cost function has many minima, gradient methods can readily give high-confidence, low-correctness solutions.

- Cuchulainn
**Posts:**62617**Joined:****Location:**Amsterdam-
**Contact:**

Gradient descent are much older than 1990s. Peter Debye did it 1909!

I was talking to T4A. I'm sure he can answer for himself.

Anyhoo, I know what gradient methods and DE are and what their limitations are.

What is interesting is their applications to ANNs. That's something to learn from the experts on this thread how to apply the stuff. It's applied maths indeed.

It is also possible to download a library and kick the tires. No bother either way.

I was talking to T4A. I'm sure he can answer for himself.

Anyhoo, I know what gradient methods and DE are and what their limitations are.

What is interesting is their applications to ANNs. That's something to learn from the experts on this thread how to apply the stuff. It's applied maths indeed.

It is also possible to download a library and kick the tires. No bother either way.

http://www.datasimfinancial.com

http://www.datasim.nl

The current view of deep learning is more on a higher level. The network is a computational graph, and the choices you make -topological, activation function- should be seen in the light of "gradient management". E.g. if you use sigmoid that saturate quickly (their output converges after training to always 1 or -1) then you can look at that as a huge information flow loss. If you goal is to classify objects in an image and you pass them though a large list of layers then each layer needs to make sure that the next ones gets enough information (the Shannon information). Something that's always +1/-1 doesn't cary much information. This is the "forward pass" view, where you pass information from layer to layer. During training you also have the "backward pass" when you do back-propagation in which you use gradient descent to adjust the model parameters in order to reduce your error/loss. In this backward pass people look at things from a gradient management point of view. A neuron who multiplies the value of another neuron has a different gradient flow modulation effect then one that *adds* the values. This is all partial derivatives, but there is a nice analogy between the types of neuron and things like gradient routers, gradient distributers, gradient switches. The tasks in *deep* neural network training is to make sure the gradient flows deep into the networks so that the whole network can participate in learning.I was talking to T4A. I'm sure he can answer for himself.Sigh.

Maybe its best you learn more first before you complain? There is a lot to learn if you're looking at 90s papers.

Anyhoo, I know what gradient methods and DE are and what their limitations are.

What is interesting is their applications to ANNs. That's something to learn from this thread, how to apply the stuff. It's applied maths indeed.

We all know that, I was talking about your first post about the two layer neural network. Things are very different now, as are the successes.Gradient descent are much older than 1990s. Peter Debye did it 1909!

- Cuchulainn
**Posts:**62617**Joined:****Location:**Amsterdam-
**Contact:**

Fair enough. At this stage I am kind of interested in the numerical methods that are being used, to take the mystique out of this area.We all know that, I was talking about your first post about the two layer neural network. Things are very different now, as are the successes.Gradient descent are much older than 1990s. Peter Debye did it 1909!

Found this. Looks standard

https://www.neuraldesigner.com/blog/5_a ... al_network

Why do all these methods need gradients? Seems like a pain in the ass. Maybe there's no other way?(?) Can you not "train NNs without gradients".Don't see why not..

http://www.datasimfinancial.com

http://www.datasim.nl

GZIP: On