Cybenko's theorem is an elegant piece of functional analysis.
Does it work in practice? How?
https://pdfs.semanticscholar.org/05ce/b ... 8c3fab.pdf
// It becomes clear why sigmoids are used.
I am guessing that's not without a good reason though. My (limited) experience with DE and gradient descent optimization(in the context of model calibration) is that the latter is orders of magnitude faster. It may be that for different kinds of problems the speed comparison is not that unfavorable for DE, don't know.But I suppose gradient descent has become the de facto standard..
The current view of deep learning is more on a higher level. The network is a computational graph, and the choices you make -topological, activation function- should be seen in the light of "gradient management". E.g. if you use sigmoid that saturate quickly (their output converges after training to always 1 or -1) then you can look at that as a huge information flow loss. If you goal is to classify objects in an image and you pass them though a large list of layers then each layer needs to make sure that the next ones gets enough information (the Shannon information). Something that's always +1/-1 doesn't cary much information. This is the "forward pass" view, where you pass information from layer to layer. During training you also have the "backward pass" when you do back-propagation in which you use gradient descent to adjust the model parameters in order to reduce your error/loss. In this backward pass people look at things from a gradient management point of view. A neuron who multiplies the value of another neuron has a different gradient flow modulation effect then one that *adds* the values. This is all partial derivatives, but there is a nice analogy between the types of neuron and things like gradient routers, gradient distributers, gradient switches. The tasks in *deep* neural network training is to make sure the gradient flows deep into the networks so that the whole network can participate in learning.I was talking to T4A. I'm sure he can answer for himself.Sigh.
Maybe its best you learn more first before you complain? There is a lot to learn if you're looking at 90s papers.
Anyhoo, I know what gradient methods and DE are and what their limitations are.
What is interesting is their applications to ANNs. That's something to learn from this thread, how to apply the stuff. It's applied maths indeed.
We all know that, I was talking about your first post about the two layer neural network. Things are very different now, as are the successes.Gradient descent are much older than 1990s. Peter Debye did it 1909!
Fair enough. At this stage I am kind of interested in the numerical methods that are being used, to take the mystique out of this area.We all know that, I was talking about your first post about the two layer neural network. Things are very different now, as are the successes.Gradient descent are much older than 1990s. Peter Debye did it 1909!