- Cuchulainn
**Posts:**57335**Joined:****Location:**Amsterdam-
**Contact:**

.

Last edited by Cuchulainn on April 27th, 2018, 6:40 pm, edited 1 time in total.

Why should I do all the work?

- Cuchulainn
**Posts:**57335**Joined:****Location:**Amsterdam-
**Contact:**

.

Last edited by Cuchulainn on April 27th, 2018, 6:40 pm, edited 1 time in total.

- Cuchulainn
**Posts:**57335**Joined:****Location:**Amsterdam-
**Contact:**

.

Last edited by Cuchulainn on April 27th, 2018, 6:41 pm, edited 1 time in total.

so the idea is to have higher order gradient of the surrogate loss function (blue) to find the actual optimal (red) ?

I think the idea and math seems very nice, it's a good new result. The "MagicBox" symbol is something my 7 year old son would come up with (actually I pretty sure he would draw a cat face operator [x] ), very distracting...

It looks computationally easy to implement, .. but the practical relevance will really depend on how well surrogate loss functions approximates the real loss function, and how well behaved the higher order gradient terms are, and how the computational costs scale (Hessians are avoided because of the practical cost/benefit). The surrogate loss functions I know only have a good local approximation, you can't do a large step size any way? These are however orthogonal issues, bad SL functions is not the topic of the paper, and I'm sure that will improve in the future too, although small batch sizes make gradients very noisy anyways.

It would be nice if someone did a benchmark on Atari agains e.g. PPO and compared computing time, sample efficiency and overall performance of the found solution?

I think the idea and math seems very nice, it's a good new result. The "MagicBox" symbol is something my 7 year old son would come up with (actually I pretty sure he would draw a cat face operator [x] ), very distracting...

It looks computationally easy to implement, .. but the practical relevance will really depend on how well surrogate loss functions approximates the real loss function, and how well behaved the higher order gradient terms are, and how the computational costs scale (Hessians are avoided because of the practical cost/benefit). The surrogate loss functions I know only have a good local approximation, you can't do a large step size any way? These are however orthogonal issues, bad SL functions is not the topic of the paper, and I'm sure that will improve in the future too, although small batch sizes make gradients very noisy anyways.

It would be nice if someone did a benchmark on Atari agains e.g. PPO and compared computing time, sample efficiency and overall performance of the found solution?

- Cuchulainn
**Posts:**57335**Joined:****Location:**Amsterdam-
**Contact:**

TheAs described in various papers/conference presentations, the AAD approach can potentially reduce the computational cost of sensitivities by several orders of magnitude, while having no approximation error. It can be used either for computing the Greeks or, respectively, for computing exact (up to machine precision) gradient (or even Hessian matrix), the last one very useful when using a gradient-based local optimizer. A framework based on AD can also be developed for automatic computation of Greeks/sensitivities from existing code, similar to various work done in in the last 15-20 years in areas such as fluid dynamics, meteorology and data assimilation. An overview of the approach and of the relevant literature is presented in http://papers.ssrn.com/sol3/papers.cfm? ... =1828503If you know any other relevant references, please mention them in this thread. In case you have used recently any AD software, if possible please share your experience with it.Thank you

- Cuchulainn
**Posts:**57335**Joined:****Location:**Amsterdam-
**Contact:**

AD is quite cute and very elegant but is it not a CS solution (using graphs) for what is essentially a problem in numerical analysis, i.e. computing a derivative at a given point?

AD is easy to understand (do by hand) and then you see the data structures. For large problems these will become yuge IMO.

For PDE I think AD will not be optimal. While FDM truncation error is small, its derivative will not be in general.

AD is easy to understand (do by hand) and then you see the data structures. For large problems these will become yuge IMO.

For PDE I think AD will not be optimal. While FDM truncation error is small, its derivative will not be in general.

GZIP: On