Serving the Quantitative Finance Community

  • 1
  • 4
  • 5
  • 6
  • 7
  • 8
  • 10
 
User avatar
Cuchulainn
Posts: 20252
Joined: July 16th, 2004, 7:38 am
Location: 20, 000

sample cuda problems in finance

April 24th, 2015, 8:29 am

QuoteOriginally posted by: outrunhere is a CUDA Tridiagonal solver using "Cyclic Reduction".. Would be cool to see if that actually works.It probably 'works' but I suspect it will have very little impact (it is micro-optimization). ADI is inherently sequential, as aahkar's data show.BTW ADE (!= ADI) does not use/need LU (it is explicit).
 
User avatar
Cuchulainn
Posts: 20252
Joined: July 16th, 2004, 7:38 am
Location: 20, 000

sample cuda problems in finance

April 24th, 2015, 11:26 am

QuoteOriginally posted by: outrunQuoteOriginally posted by: CuchulainnQuoteOriginally posted by: outrunADI is inherently sequential, as aahkar's data show.That's wrong reasoning about causality. The doesn't *show* that it is inherently sequential, he (maybe?) used a sequential tridiagonal solver, .. for which there are better options like the one I found.No. It's the knowledge of PDE/FDM's inherently sequential. BTW ADE does not use LU (nor cyclic) and it still has bad speedup. I did it in OpenMP, so speedup is also < 2 at best.Try it yourself and see. QuoteThat's wrong reasoning about causalityNope. What is the serial fraction of FDM. That will give the answer.
Last edited by Cuchulainn on April 23rd, 2015, 10:00 pm, edited 1 time in total.
 
User avatar
Cuchulainn
Posts: 20252
Joined: July 16th, 2004, 7:38 am
Location: 20, 000

sample cuda problems in finance

April 24th, 2015, 11:59 am

OK, here's a good example: parallelise EXPLICIT EULER method.
 
User avatar
Cuchulainn
Posts: 20252
Joined: July 16th, 2004, 7:38 am
Location: 20, 000

sample cuda problems in finance

April 24th, 2015, 12:21 pm

QuoteOriginally posted by: outrunhere is a CUDA Tridiagonal solver using "Cyclic Reduction".. Would be cool to see if that actually works.Here's an article on sameI notice the matrices are very large and not realistic; in many case NS = [300,500] is fine.
 
User avatar
ExSan
Posts: 493
Joined: April 12th, 2003, 10:40 am

sample cuda problems in finance

April 24th, 2015, 12:37 pm

QuoteOriginally posted by: CuchulainnI copied some AMP code and compiled. Can anyone say whether this is so terribly bad and why?Of course, CUDA is closer to the hardware do it is more efficient bla bla bla :D We know all that.I copy-pasted your code and it runs, the output is as predicted. I am trying to upload the executable to the FileShare, the file.exe does not run, it gets stucked? do you experience the same problem ?
°°° About ExSan bit.ly/3U5bIdq °°°
 
User avatar
ashkar
Posts: 0
Joined: October 17th, 2011, 9:25 am

sample cuda problems in finance

April 24th, 2015, 3:52 pm

QuoteOriginally posted by: outrunQuoteOriginally posted by: CuchulainnQuoteOriginally posted by: outrunADI is inherently sequential, as aahkar's data show.That's wrong reasoning about causality. The doesn't *show* that it is inherently sequential, he (maybe?) used a sequential tridiagonal solver, .. for which there are better options like the one I found.I use the cuSPARSE tridiagonal solver which is using cyclic reduction algorithm. However in that cc version the implementation is less efficient. From what i've read the implementation has been improved in new versions but even then i dont expect to see much more efficiency.Outrun, maybe I'm missing something. Have you actually run a tridiagonal solver on cuda and seen much greater efficiency? Dont forget to include the cost of copying memory to make it realistic.
 
User avatar
Cuchulainn
Posts: 20252
Joined: July 16th, 2004, 7:38 am
Location: 20, 000

sample cuda problems in finance

April 24th, 2015, 6:09 pm

QuoteOriginally posted by: ExSanQuoteOriginally posted by: CuchulainnI copied some AMP code and compiled. Can anyone say whether this is so terribly bad and why?Of course, CUDA is closer to the hardware do it is more efficient bla bla bla :D We know all that.I copy-pasted your code and it runs, the output is as predicted. I am trying to upload the executable to the FileShare, the file.exe does not run, it gets stucked? do you experience the same problem ?No, me a CUDA newbee :D What's Fileshare?
Last edited by Cuchulainn on April 23rd, 2015, 10:00 pm, edited 1 time in total.
 
User avatar
Cuchulainn
Posts: 20252
Joined: July 16th, 2004, 7:38 am
Location: 20, 000

sample cuda problems in finance

April 24th, 2015, 6:13 pm

Quotethat's why I didn't bother testing adi on cuda...I'll try boost ode in next few days as per your suggestion.A variation of the anchoring FDM (2 factor x,y and t) is to do semi-discretization in x and y (MOL, thus) to get an ODE and then plug into odeint/CUDA.I can give you the code do that you don't have to worry about the x,y part. http://www.wilmott.com/messageview.cfm? ... SGDBTABLE=
Last edited by Cuchulainn on April 23rd, 2015, 10:00 pm, edited 1 time in total.
 
User avatar
Cuchulainn
Posts: 20252
Joined: July 16th, 2004, 7:38 am
Location: 20, 000

sample cuda problems in finance

April 25th, 2015, 8:03 am

QuoteOriginally posted by: ashkarQuoteOriginally posted by: outrunQuoteOriginally posted by: CuchulainnQuoteOriginally posted by: outrunADI is inherently sequential, as aahkar's data show.That's wrong reasoning about causality. The doesn't *show* that it is inherently sequential, he (maybe?) used a sequential tridiagonal solver, .. for which there are better options like the one I found.I use the cuSPARSE tridiagonal solver which is using cyclic reduction algorithm. However in that cc version the implementation is less efficient. From what i've read the implementation has been improved in new versions but even then i dont expect to see much more efficiency.Outrun, maybe I'm missing something. Have you actually run a tridiagonal solver on cuda and seen much greater efficiency? Dont forget to include the cost of copying memory to make it realistic.For PDE in computational finance you make sure you have a good scheme (NS ~ 400, NT ~ 300, whatever) which means the problem is too small to parallelise. Value like NS ~ 2000 means the scheme is fundamentally wrong. I really don't believe using cyclic reduction algorithm will help greatly. Ashkar has proved it.If your region of integration is the Atlantic Ocean then the situation takes a different form.
Last edited by Cuchulainn on April 24th, 2015, 10:00 pm, edited 1 time in total.
 
User avatar
Cuchulainn
Posts: 20252
Joined: July 16th, 2004, 7:38 am
Location: 20, 000

sample cuda problems in finance

April 25th, 2015, 8:05 am

QuoteOriginally posted by: outrunQuoteOriginally posted by: CuchulainnQuoteOriginally posted by: outrunhere is a CUDA Tridiagonal solver using "Cyclic Reduction".. Would be cool to see if that actually works.Here's an article on sameI notice the matrices are very large and not realistic; in many case NS = [300,500] is fine.nice article!Indeed. But the _speedup_ conclusions are disappointing
 
User avatar
Cuchulainn
Posts: 20252
Joined: July 16th, 2004, 7:38 am
Location: 20, 000

sample cuda problems in finance

April 25th, 2015, 12:42 pm

QuoteOriginally posted by: outrunQuoteOriginally posted by: CuchulainnQuoteOriginally posted by: outrunQuoteOriginally posted by: CuchulainnQuoteOriginally posted by: outrunhere is a CUDA Tridiagonal solver using "Cyclic Reduction".. Would be cool to see if that actually works.Here's an article on sameI notice the matrices are very large and not realistic; in many case NS = [300,500] is fine.nice article!Indeed. But the _speedup_ conclusions are disappointingthe bottleneck seems to be "solving systems", e.g. I expect linear speedup (as a function of cores) for explicit methods or for simple BLAS based linear algebra.The thing that's is missing is how it scales as a function of cores. 1 single GPU core is slower than a CPU and so we can't say much about scalability comparing 1 CPU core against 480 GPU cores. What if we buy a more top of the line GPU with 4000 cores, will it perform the same? Or 10x faster, or 3x faster?Another point is that you need to adapt the solution method based of hardware: the specs "I want's 6 significant digits within 1 sec, I don't care how you do it" is more realistic.I don't agree.PDE are not conducive to parallelization. BLAS is not the bottleneck. PDE/FDM is inherently sequential. The speedup is awful.The human gestation period is nine months. It is not possible to reduce it to one month by using nine mothers. Even with 4000 cores.
Last edited by Cuchulainn on April 24th, 2015, 10:00 pm, edited 1 time in total.