Did you guys discuss this here already?
https://arxiv.org/pdf/1706.04702.pdf
Geoff Hinton (the guy who invented back-propagation) kind of agrees with you: https://www.axios.com/ai-pioneer-advoca ... 37027.htmlIs there no alternative to gradient-based methods here? Gradients are badly-behaved objects.They fly off the handle so easily.
People use evolutionary algorithms to search for better network architectures, but practically everyone trains a NN using Stochastic Gradient Descent.I hope others can feel it is for them as well. I don't have a case I worked out myself but that should not stop our asking questions such as:
1. Why not use Differential Evolution as well as Gradient Descent (saying it's slow is disingenuous)
It is, that's why TensorFlow is C++ under the hood.2. I would like C++ as well as Python. Is Python slow?
Of course! Speech recognition, stochastic control, playing games...3. Is DL for more than computer vision?
Try this: https://www.tensorflow.org/get_started/mnist/beginners4. A 101 example A-Z just to show how it works.
And... surprise! Neural Networks are not robust: https://blog.openai.com/adversarial-example-research/5. I had some exposure to topology in previous life: somehow TDA feels more robust than universal approximators.
These are genuine questions.
You don't want a global minimum on the training set, because that would be overfitting: early stopping1. NN are typically 10k to 10mln dimensional functions. DE would be very slow, you would need at least the double of number agents but in practice many more or else you would be searching in a subspace. Each agent represent an instance of the network, so big memory consumption and computations on it.
I don't get this answer, not at all. Are you saying finding a local minimum is OK?
1. In every branch, we desire a global minimum >> local minimum, even in DL (have a look at Goodfellow figure 4.3 page 81).
Yes.2. Gradient descent methods fined local minimum, not necessarily global. Is that serious?
Your points feel the same way to me too more precision would be welcome3, Overfitting is caused by high-order polynomials (yes?). I don't see what the relationship is with finding minima.
4. More evidence is needed on "how slow" DE is (the good news is that always give a global minimum).
5. The Cybenko universal approximation theorem seems to have little coupling to anything in mainstream numerical approximation. Maybe it is not necessary, but maybe say that. Borel measures and numerical accuracy are not a good mix IMO.
Mathematically, it feels that this approach is not even wrong..