so the idea is to have higher order gradient of the surrogate loss function (blue) to find the actual optimal (red) ?
I think the idea and math seems very nice, it's a good new result. The "MagicBox" symbol is something my 7 year old son would come up with (actually I pretty sure he would draw a cat face operator
[x] ), very distracting...
It looks computationally easy to implement, .. but the practical relevance will really depend on how well surrogate loss functions approximates the real loss function, and how well behaved the higher order gradient terms are, and how the computational costs scale (Hessians are avoided because of the practical cost/benefit). The surrogate loss functions I know only have a good local approximation, you can't do a large step size any way? These are however orthogonal issues, bad SL functions is not the topic of the paper, and I'm sure that will improve in the future too, although small batch sizes make gradients very noisy anyways.
It would be nice if someone did a benchmark on Atari agains e.g. PPO and compared computing time, sample efficiency and overall performance of the found solution?