- Cuchulainn
**Posts:**62371**Joined:****Location:**Amsterdam-
**Contact:**

QuoteYour CUDA commodity argument It still doesn't address AMP the issue you ran into: * you likely wrote parallel code using AMP that performed worse than simple sequential codeActually, I have not written any AMP code yet, and maybe I won't. I am seeing what the views are. And we know your views already. I am interested in how long it takes for the average quant to learn. At the end of the day it is a niche market.And yes, I have ordered that book. QuoteYou should be able to teach yourself CUDA in one week time and then move on to the next thing. What's the next hardware platform you will be working on next week?

Last edited by Cuchulainn on April 19th, 2015, 10:00 pm, edited 1 time in total.

In my limited experience (~1 month) of playing around with CUDA, I got my basic CUDA implementation of pricing to run ~4x faster after reading and applying CUDA specific optimisation such as utilising memory hierarchy and low level implementation details of the cores and scheduler. 4x is a big difference and I suspect this is the case because in CPU world you get all this for free from a decent compiler. GPU technology is still young (but improving with each cc version). I dont know how AMP works but I doubt it would scale well against a good CUDA implementation. I find CUDA quite interesting because I wasnt around when CPU technology went through this development so this is my chance to follow the developments in GPGPU technology.Cuch to answer your original question about migrating C++ (pattern/templated) code to CUDA...CUDA kernels (user defined procedure calls that run on cuda cores) run limited version of C i.e. C without library calls. So your C++ implementation of MC will require some re-engineering. Most existing quant libraries have compute code in old school C and other stuff in C++. So it wouldnt be a huge re-engineering challenge to get a basic implementation working with parts of the computation running on GPU. Ofcourse part of the engineering will be to write parallel processing friendly algorithms rather than the pure math versions written by a maths dude back in the 90s.

- Cuchulainn
**Posts:**62371**Joined:****Location:**Amsterdam-
**Contact:**

QuoteOriginally posted by: ashkarIn my limited experience (~1 month) of playing around with CUDA, I got my basic CUDA implementation of pricing to run ~4x faster after reading and applying CUDA specific optimisation such as utilising memory hierarchy and low level implementation details of the cores and scheduler. 4x is a big difference and I suspect this is the case because in CPU world you get all this for free from a decent compiler. GPU technology is still young (but improving with each cc version). I dont know how AMP works but I doubt it would scale well against a good CUDA implementation. I find CUDA quite interesting because I wasnt around when CPU technology went through this development so this is my chance to follow the developments in GPGPU technology.Cuch to answer your original question about migrating C++ (pattern/templated) code to CUDA...CUDA kernels (user defined procedure calls that run on cuda cores) run limited version of C i.e. C without library calls. So your C++ implementation of MC will require some re-engineering. Most existing quant libraries have compute code in old school C and other stuff in C++. So it wouldnt be a huge re-engineering challenge to get a basic implementation working with parts of the computation running on GPU. Ofcourse part of the engineering will be to write parallel processing friendly algorithms rather than the pure math versions written by a maths dude back in the 90s.Nice.

QuoteTypically only part of a computation can be parallelized.Suppose 50% of the computation is inherently sequential, and the other 50% can be parallelized.Question: How much faster could the computation potentially run on many processors?Answer: At most a factor of 2, no matter how many processors. The sequential part is taking half the time and that time is still required even if the parallel part is reduced to zero timeHigh Performance Scientific ComputingUniversity of Washington Prof: Dr. Randall J. LeVeque---> ~4x faster this is the limit

- Cuchulainn
**Posts:**62371**Joined:****Location:**Amsterdam-
**Contact:**

QuoteOriginally posted by: ExSanQuoteTypically only part of a computation can be parallelized.Suppose 50% of the computation is inherently sequential, and the other 50% can be parallelized.Question: How much faster could the computation potentially run on many processors?Answer: At most a factor of 2, no matter how many processors. The sequential part is taking half the time and that time is still required even if the parallel part is reduced to zero timeHigh Performance Scientific ComputingUniversity of Washington Prof: Dr. Randall J. LeVeque---> ~4x faster this is the limitIndeed, Amdahl's law. BTW is Amdahl's law hold for GPU boxes? Efficiency == Speedup/#processors.Outrun get a speedup of 400. Follow on: Look (just look) at a numerical algorithm and tell what the serial fraction is just by looking at it without writing 1 line of code. e.g. ADI ~ 90%?

Last edited by Cuchulainn on April 20th, 2015, 10:00 pm, edited 1 time in total.

- Cuchulainn
**Posts:**62371**Joined:****Location:**Amsterdam-
**Contact:**

Everyone talks about MC which is simple maths and is embarrassingly parallel.. What about other examples on CUDA?ADI??

QuoteOriginally posted by: CuchulainnEveryone talks about MC which is simple maths and is embarrassingly parallel.. What about other examples on CUDA?ADI??Hi cuchI can try and cudafy an ADI scheme example if you can send me one.In general what i saw was that simple implicit method which require solving a set of linear equations doesnt scale well on cuda (or any parallel proc env) for the problem sizes we face in option pricing. This is due to the elimination part which cannot be parallelised. for example, on a 16core old video card, for a BS const vol, implicit pde with regular mesh sizes, i see the following time-steps and spot-steps i see:Spot(x) Time(y) CPU GPU GPU/CPU100 100 0.00 0.1998 0.01500 100 0.01 0.047 0.111000 500 0.02 0.93 0.032000 500 0.06 0.98 0.065000 500 0.34 1.15 0.2910000 500 0.87 1.34 0.6520000 500 1.99 1.625 1.2230000 500 3.07 1.89 1.6240000 500 4.10 2.23 1.8480000 1000 21.79 8.2 2.66With newer cards a lot has improved e.g. no need to back to cpu at each time step etc so the efficiency will likely be better in newer card. I think most people in finance end up only running monte-carlo on cuda.

- Cuchulainn
**Posts:**62371**Joined:****Location:**Amsterdam-
**Contact:**

Hi ashkar,Thank you for your kind offer. Let's do it!QuoteIn general what i saw was that simple implicit method which require solving a set of linear equations doesnt scale well on cuda (or any parallel proc env) for the problem sizes we face in option pricing. This is due to the elimination part which cannot be parallelised. for example, on a 16core old video card, for a BS const vol, implicit pde with regular mesh sizes, i see the following time-steps and spot-steps i see:Yes. ADI won't be much better I expect. It uses tridiagonal matrices at each ADI/Soviet Splitting leg. I heard GPU for ADI was even slower than sequential, which does not come as a huge surprise to be honest.The good news is that I now use ADE which needs NO matrix LU decomposition, just basic matrix manipulation. I suggest doing the port of this paper by Alan, Paul and myself in a Asian-style PDEhereThe code I have in C++ using Boost uBLAS matrix but this could be replaced by your favourite CUDA matrix. So there is code results in both C++ and Mathematica to compare with; it is a baseline reference case. Hotspots are 1) logistic vol function takes 75% of computation in CPU, 2) computing the volatility surface, we could do on GPU?

Last edited by Cuchulainn on April 22nd, 2015, 10:00 pm, edited 1 time in total.

- Cuchulainn
**Posts:**62371**Joined:****Location:**Amsterdam-
**Contact:**

QuoteSpot(x) Time(y) CPU GPU GPU/CPU100 100 0.00 0.1998 0.01500 100 0.01 0.047 0.111000 500 0.02 0.93 0.032000 500 0.06 0.98 0.065000 500 0.34 1.15 0.2910000 500 0.87 1.34 0.6520000 500 1.99 1.625 1.2230000 500 3.07 1.89 1.6240000 500 4.10 2.23 1.8480000 1000 21.79 8.2 2.66I suppose spot = #S divisions and Time(y) = #time divisions?In general NS = NT ~ 400 gives 2/3 digits accuracy, so GPU loses out, yes?The serial fraction of ADI is high, so Amdahl's law will be very pessimistic..Quote I think most people in finance end up only running monte-carlo on cuda. It's a relatively simple SPMD?An interesting option is to do Method Of Lines (MOL) on the Anchoring PDE and then Boost C++ odeint on GPU?Quote[PDF]Solving ODEs with CUDA and OpenCL - Using Boost.OdeintSolving ODEs with CUDA and OpenCL. Using Boost.Odeint. Karsten Ahnert1,2. Mario Mulansky2, Denis Demidov3, Karl Rupp4, and Peter Gottschling5.

Last edited by Cuchulainn on April 22nd, 2015, 10:00 pm, edited 1 time in total.

QuoteOriginally posted by: CuchulainnHi ashkar,Thank you for your kind offer. Let's do it!QuoteIn general what i saw was that simple implicit method which require solving a set of linear equations doesnt scale well on cuda (or any parallel proc env) for the problem sizes we face in option pricing. This is due to the elimination part which cannot be parallelised. for example, on a 16core old video card, for a BS const vol, implicit pde with regular mesh sizes, i see the following time-steps and spot-steps i see:Yes. ADI won't be much better I expect. It uses tridiagonal matrices at each ADI/Soviet Splitting leg. I heard GPU for ADI was even slower than sequential, which does not come as a huge surprise to be honest.The good news is that I now use ADE which needs NO matrix LU decomposition, just basic matrix manipulation. I suggest doing the port of this paper by Alan, Paul and myself in a Asian-style PDEhereThe code I have in C++ using Boost uBLAS matrix but this could be replaced by your favourite CUDA matrix. So there is code results in both C++ and Mathematica to compare with; it is a baseline reference case. Hotspots are 1) logistic vol function takes 75% of computation in CPU, 2) computing the volatility surface, we could do on GPU?I don't mind testing it for you I've been thinking of revisiting tridiag solver type issue again myself since I bought a new cuda card and apparently the back substitution stage should be more efficient on this card partly due to new library version too. I have a parametric vol surface implementation which I fit to market data in advance. Then during computation (for example in local vol) I can compute the parametric vols on the gpu. the simpler the vol function the better since only around 20% of cuda cores can compute special functions like exp, log, sin while other cores queue up so efficiency goes down.Is boost matrix stored as a C array in memory? Either way I can adapt the code. How do you want to send it? I had a brief look at the paper - the sde part. Still need to read the ADE section.

QuoteOriginally posted by: CuchulainnQuoteSpot(x) Time(y) CPU GPU GPU/CPU100 100 0.00 0.1998 0.01500 100 0.01 0.047 0.111000 500 0.02 0.93 0.032000 500 0.06 0.98 0.065000 500 0.34 1.15 0.2910000 500 0.87 1.34 0.6520000 500 1.99 1.625 1.2230000 500 3.07 1.89 1.6240000 500 4.10 2.23 1.8480000 1000 21.79 8.2 2.66I suppose spot = #S divisions and Time(y) = #time divisions?In general NS = NT ~ 400 gives 2/3 digits accuracy, so GPU loses out, yes?The serial fraction of ADI is high, so Amdahl's law will be very pessimistic..Quote I think most people in finance end up only running monte-carlo on cuda. It's a relatively simple SPMD?An interesting option is to do Method Of Lines (MOL) on the Anchoring PDE and then Boost C++ odeint on GPU?Quote[PDF]Solving ODEs with CUDA and OpenCL - Using Boost.OdeintSolving ODEs with CUDA and OpenCL. Using Boost.Odeint. Karsten Ahnert1,2. Mario Mulansky2, Denis Demidov3, Karl Rupp4, and Peter Gottschling5.Yes. Spots=# spot division, similarly for time. the last column is the gpu computation time/cpu computation time.for example you could approximate from the result, the crappy GPU that I had (16x cores 512mb mem) starts to see any benefit around 20k spot steps 500 time steps ~ tridiagonal solver shows 1.22x speedup on a matrix of 20000x20000 tridiag (sparse) matrix.that's why I didn't bother testing adi on cuda...I'll try boost ode in next few days as per your suggestion.

GZIP: On