BTW I am trying to get my head around DE for backpropagation by using it instead of *GD.
Correct me if I am wrong, but the jury is still out on the applicability of GD.
Lets start at the beginning: what variable specify the loss functions, and what is the goal of minimizing a loss function?
One of the alternative to back propagation used in Reinforcement Learing is called "Evolutionary Strategies", which doesn't use back propagation or gradients.
"I Rechenberg & M Eigen. Evolutionsstrategie: Optimierung Technischer Systeme nach Prinzipien der Biologischen Evolution. Frommann-Holzboog Stuttgart, 1973"
Gradients are just one way .. it's not a Commandment or a law of gravity.
BTW I remember while back your saying that you did not speak German.
Anyways, hybrid DE-BP (GA-BP) could be promising.
Q: is it worth while buying
Rechenberg if I have John Holland's and David Goldberg's books?
Indeed, I don't read German. The paper is a "classic" that pops up in more recent papers, normally you appreciate that?
The reason I keep asking about "the loss function" is because you should shift focus to the "full problem" -what is it that ML tries accomplish?- so that you can better see what the real bottlenecks and problems are, the dimension of things, where is the computational cost, what is the objective (out of sample prediction, generalising, not memorising).. If you experiment with a real problem (MNIST is a classic) that you will experience that.
A hypothetical case: The reason people don't use Newton Raphson in GD is *not* because no-one thus far had had the bright idea to blow the dust of their Philosophiæ Naturalis Principia Mathematica, but because the parameter space is typically huge for relevant problems, and NR is O(D^2) and the computational burden of that is just too big.. You might not experience that with toy problems, but you *will* when working trying out your idea on something like MNIST. The same goes for realising that there is a train and test set in typical setups. Why do they do that, and what does is say about what you try to accomplish? Or experience the size of datasets and the implications that has when you want to make that work with GPUs..
Instead buying old books I would just google for "derivative free methods in neural networks" and make an inventory about what has been done so far. Or just for general optimization use this as a starting point
http://thales.cheme.cmu.edu/dfo/comparison/dfo.pdf
For a book I would recommend the upcoming *2nd edition* (2018) of "Reinforcement Learning: An Introduction"
are final drafts you can download for free). It's a bit like PWOQFII for RL (I.O.W I like it *a lot*), it however much broader than just optimising a loss function, it's more about learning optimal behaviour in an unknown environment. There is a lot of opportunities in this field to find advances, just the last couple of years has seen a magnitude in improvement and it not losing any momentum.
A very easy to read recent paper on using ES in NN/RL is one by (amazingly) Uber:
http://eng.uber.com/wp-content/uploads/ ... -arxiv.pdf