SERVING THE QUANTITATIVE FINANCE COMMUNITY

  • 1
  • 6
  • 7
  • 8
  • 9
  • 10
 
User avatar
AlexEro
Posts: 203
Joined: November 25th, 2014, 4:27 pm
Location: Ukraine
Contact:

sample cuda problems in finance

September 27th, 2015, 8:48 am

QuoteOriginally posted by: CuchulainnFor the great unwashed like myself could you explain all these numbers? On a Tesla C20270 and CUDA 5.5 we report speedups in the range [30,150] for Asian and lookback options using the MLMC method. From your chart CUDA 5.5 has quite a numbers of 'updates'/improvements?CUDA 5.5 is a milestone as for distribution of your software:1). It allows STATIC CUDA libraries to be linked with your application and therefore does not require CUDA's dynamic DLL to be present along with your DLL or EXE file of your project. 2). It includes most of the CUDA powerful features AND requires very old driver to support them on the side of your client's system.3). It is finally stable-multithreading (it was since 4.0 but with some minor issues).This is not about speed, the compatibility between your system and client's system is more important. If you link a project under CUDA 7.5 it will absolutely not run on a client's machine until client will install the latest driver 35x. Recent versions of CUDA like 6.5 are faster by 10% but you can face several driver's and hardware imcompatibles on a client's side.
 
User avatar
Cuchulainn
Posts: 62391
Joined: July 16th, 2004, 7:38 am
Location: Amsterdam
Contact:

sample cuda problems in finance

October 1st, 2015, 9:06 am

GPU and CUDA are useful for embarrassingly parallel problems such as MC/MLMC and so on. That is well known.What is less well known maybe is that it is less efficient for non SPMD problems. To take an example, tests on 2-factor PDE/ADE problems can have a speedup tat depends on the sizes of NX, NY, NT. For small values speedup can be 50, while for larger values it can converge to a value like 3 (or 1.3).In this regards OpenMP is more suitable for PDE/ADE.
Last edited by Cuchulainn on September 30th, 2015, 10:00 pm, edited 1 time in total.
 
User avatar
Cuchulainn
Posts: 62391
Joined: July 16th, 2004, 7:38 am
Location: Amsterdam
Contact:

sample cuda problems in finance

October 1st, 2015, 11:48 am

QuoteOriginally posted by: outrunQuoteOriginally posted by: CuchulainnGPU and CUDA are useful for embarrassingly parallel problems such as MC/MLMC and so on. That is well known.What is less well known maybe is that it is less efficient for non SPMD problems. To take an example, tests on 2-factor PDE/ADE problems can have a speedup tat depends on the sizes of NX, NY, NT. For small values speedup can be 50, while for larger values it can converge to a value like 3 (or 1.3).In this regards OpenMP is more suitable for PDE/ADE.That's strange, I would expect the GPU to be much faster for PDEs.Where did you get that low 1.3 value? How many cores was that on, and did you use GPU optimized linear algebra libs?The results will be made public domain soon.Not strange at all, it's what I expected. And AFAIK no one has done it. Otherwise they would be shouting it for the rooftops? PDE in finance are too small.It's nothing got to do with them linear algebra things (that's not the issue); PDE is intrinsically sequential (hint: compute the serial fraction of a typical FD scheme, just take explicit Euler with NT = 10000, NX = 50, not a matrix in sight but no point parallelizing. QED).I would not want too much time on PDE and GPU. Maybe combining OpenMP and GPU is an option. As mentioned, OpenMP for ADE give speedup 5 on 8-core machine without effort!!
Last edited by Cuchulainn on September 30th, 2015, 10:00 pm, edited 1 time in total.
 
User avatar
Cuchulainn
Posts: 62391
Joined: July 16th, 2004, 7:38 am
Location: Amsterdam
Contact:

sample cuda problems in finance

October 1st, 2015, 2:36 pm

QuoteOriginally posted by: outrunQuoteOriginally posted by: CuchulainnQuoteOriginally posted by: outrunQuoteOriginally posted by: CuchulainnGPU and CUDA are useful for embarrassingly parallel problems such as MC/MLMC and so on. That is well known.What is less well known maybe is that it is less efficient for non SPMD problems. To take an example, tests on 2-factor PDE/ADE problems can have a speedup tat depends on the sizes of NX, NY, NT. For small values speedup can be 50, while for larger values it can converge to a value like 3 (or 1.3).In this regards OpenMP is more suitable for PDE/ADE.That's strange, I would expect the GPU to be much faster for PDEs.Where did you get that low 1.3 value? How many cores was that on, and did you use GPU optimized linear algebra libs?The results will be made public domain soon.Not strange at all, it's what I expected. And AFAIK no one has done it. Otherwise they would be shouting it for the rooftops? PDE in finance are too small.It's nothing got to do with them linear algebra things (that's not the issue); PDE is intrinsically sequential (hint: compute the serial fraction of a typical FD scheme, just take explicit Euler with NT = 10000, NX = 50, not a matrix in sight but no point parallelizing. QED).I would not want too much time on PDE and GPU. Maybe combining OpenMP and GPU is an option. As mentioned, OpenMP for ADE give speedup 5 on 8-core machine without effort!!But how can you speedup sequential code with OpenMP that can't be done on GPU? There *has* be parallel elements in order to have a case for OpenMP,.. and if so then a GPU has probably some hardware extras that regular CPU scattered across multiple machines just don't have (like fast shared memory)... post the paper when it's done! Ok?To answer this question you first have to know how ADE works. A back-of-envelope too
Last edited by Cuchulainn on September 30th, 2015, 10:00 pm, edited 1 time in total.
 
User avatar
Traden4Alpha
Posts: 23951
Joined: September 20th, 2002, 8:30 pm

sample cuda problems in finance

October 1st, 2015, 4:44 pm

QuoteOriginally posted by: outrunQuoteOriginally posted by: CuchulainnQuoteOriginally posted by: outrunQuoteOriginally posted by: CuchulainnQuoteOriginally posted by: outrunQuoteOriginally posted by: CuchulainnGPU and CUDA are useful for embarrassingly parallel problems such as MC/MLMC and so on. That is well known.What is less well known maybe is that it is less efficient for non SPMD problems. To take an example, tests on 2-factor PDE/ADE problems can have a speedup tat depends on the sizes of NX, NY, NT. For small values speedup can be 50, while for larger values it can converge to a value like 3 (or 1.3).In this regards OpenMP is more suitable for PDE/ADE.That's strange, I would expect the GPU to be much faster for PDEs.Where did you get that low 1.3 value? How many cores was that on, and did you use GPU optimized linear algebra libs?The results will be made public domain soon.Not strange at all, it's what I expected. And AFAIK no one has done it. Otherwise they would be shouting it for the rooftops? PDE in finance are too small.It's nothing got to do with them linear algebra things (that's not the issue); PDE is intrinsically sequential (hint: compute the serial fraction of a typical FD scheme, just take explicit Euler with NT = 10000, NX = 50, not a matrix in sight but no point parallelizing. QED).I would not want too much time on PDE and GPU. Maybe combining OpenMP and GPU is an option. As mentioned, OpenMP for ADE give speedup 5 on 8-core machine without effort!!But how can you speedup sequential code with OpenMP that can't be done on GPU? There *has* be parallel elements in order to have a case for OpenMP,.. and if so then a GPU has probably some hardware extras that regular CPU scattered across multiple machines just don't have (like fast shared memory)... post the paper when it's done! Ok?To answer this question you first have to know how ADE works. A back-of-envelope tooIf it's sequential then there is no gain in adding processing units / cores, right?If it *is* then there is nothing magical that would allow openMP to process bits in parallel but then somehow that would not be possible with a GPU. if openMP can split a job into 10 bits that can be executed in parallel and which would give a 10 fold speed up, then so should a GPU.Isn't the problem of parallelizing these PDEs tied to the topology of the mesh and the need for each core access values in mesh-adjacent cores? I get the sense that the core-to-core accessibility and bandwidth inside a GPU is particularly poor.
 
User avatar
Cuchulainn
Posts: 62391
Joined: July 16th, 2004, 7:38 am
Location: Amsterdam
Contact:

sample cuda problems in finance

October 1st, 2015, 5:14 pm

In fairness, 1.3 is worst case. For 2d factors CEV, speedups of [57,116] are realized. For 1-factor [5,12] but it all depends on the NY/NT ratio.
Last edited by Cuchulainn on September 30th, 2015, 10:00 pm, edited 1 time in total.
 
User avatar
Cuchulainn
Posts: 62391
Joined: July 16th, 2004, 7:38 am
Location: Amsterdam
Contact:

sample cuda problems in finance

October 1st, 2015, 5:40 pm

QuoteIsn't the problem of parallelizing these PDEs tied to the topology of the mesh and the need for each core access values in mesh-adjacent cores? I get the sense that the core-to-core accessibility and bandwidth inside a GPU is particularly poor. Yes. In particular, we can get false sharing when 2 threads modify data stored on the same cache lines.
 
User avatar
AlexEro
Posts: 203
Joined: November 25th, 2014, 4:27 pm
Location: Ukraine
Contact:

sample cuda problems in finance

October 13th, 2015, 1:09 am

QuoteOriginally posted by: CuchulainnQuoteIsn't the problem of parallelizing these PDEs tied to the topology of the mesh and the need for each core access values in mesh-adjacent cores? I get the sense that the core-to-core accessibility and bandwidth inside a GPU is particularly poor. Yes. In particular, we can get false sharing when 2 threads modify data stored on the same cache lines.PCI-E bus transfers from/to host<->GPU can kill any algorithmic speedup. The overhead is always 1-5 ms per cudaMemcpy, and cannot be reduced. Hello, Intel.
 
User avatar
Cuchulainn
Posts: 62391
Joined: July 16th, 2004, 7:38 am
Location: Amsterdam
Contact:

sample cuda problems in finance

October 14th, 2015, 4:35 pm

QuoteOriginally posted by: AlexEroQuoteOriginally posted by: CuchulainnQuoteIsn't the problem of parallelizing these PDEs tied to the topology of the mesh and the need for each core access values in mesh-adjacent cores? I get the sense that the core-to-core accessibility and bandwidth inside a GPU is particularly poor. Yes. In particular, we can get false sharing when 2 threads modify data stored on the same cache lines.PCI-E bus transfers from/to host<->GPU can kill any algorithmic speedup. The overhead is always 1-5 ms per cudaMemcpy, and cannot be reduced. Hello, Intel.For some problems (non SPMD) GPU is less than optimal and then it is better to do everything on an n-core Intel.Seems that MC is a good candidate for GPU, but certainly not PDE, for sure.
 
User avatar
AlexEro
Posts: 203
Joined: November 25th, 2014, 4:27 pm
Location: Ukraine
Contact:

sample cuda problems in finance

October 15th, 2015, 3:46 am

QuoteOriginally posted by: CuchulainnQuoteOriginally posted by: AlexEroQuoteOriginally posted by: CuchulainnQuoteIsn't the problem of parallelizing these PDEs tied to the topology of the mesh and the need for each core access values in mesh-adjacent cores? I get the sense that the core-to-core accessibility and bandwidth inside a GPU is particularly poor. Yes. In particular, we can get false sharing when 2 threads modify data stored on the same cache lines.PCI-E bus transfers from/to host<->GPU can kill any algorithmic speedup. The overhead is always 1-5 ms per cudaMemcpy, and cannot be reduced. Hello, Intel.For some problems (non SPMD) GPU is less than optimal and then it is better to do everything on an n-core Intel.Seems that MC is a good candidate for GPU, but certainly not PDE, for sure.I am not so sure about PDE and any other algo. There is always an algo, that seems "weird" on a "sequential" host processor, but will be amaizingly fast on a parallel GPU. It is only a matter of how tricky is your programming style. To be precise : of "how tricky COULD BE your programming style". The last stage of development on CUDA can be very SLOW, because You have to have final fight with GPU hardware.I strongly recommend to use "nvprof" profiler program from CUDA package. Use version 6.0 or higher (you have to install proper video driver with CUDA 6.0, see above a list of drivers), it shows up much more useful info. Typical .BAT files (3 of them) to use various modes of operations of nvprof :nvprof.exe --print-api-trace --log-file CUDA-API.txt terminal.exe /portable ornvprof.exe --print-gpu-trace --log-file CUDA-Kernels.txt terminal.exe /portable ornvprof.exe terminal.exe /portable(on-screen short info)"terminal /portable" is your application with it's command line option (MT4 trading terminal in this case).
ABOUT WILMOTT

PW by JB

Wilmott.com has been "Serving the Quantitative Finance Community" since 2001. Continued...


Twitter LinkedIn Instagram

JOBS BOARD

JOBS BOARD

Looking for a quant job, risk, algo trading,...? Browse jobs here...


GZIP: On