QuoteOriginally posted by: outrunQuoteOriginally posted by: CuchulainnQuoteOriginally posted by: outrunQuoteOriginally posted by: CuchulainnQuoteOriginally posted by: outrunQuoteOriginally posted by: CuchulainnGPU and CUDA are useful for embarrassingly parallel problems such as MC/MLMC and so on. That is well known.What is less well known maybe is that it is less efficient for non SPMD problems. To take an example, tests on 2-factor PDE/ADE problems can have a speedup tat depends on the sizes of NX, NY, NT. For small values speedup can be 50, while for larger values it can converge to a value like 3 (or 1.3).In this regards OpenMP is more suitable for PDE/ADE.That's strange, I would expect the GPU to be much faster for PDEs.Where did you get that low 1.3 value? How many cores was that on, and did you use GPU optimized linear algebra libs?The results will be made public domain soon.Not strange at all, it's what I expected. And AFAIK no one has done it. Otherwise they would be shouting it for the rooftops? PDE in finance are too small.It's nothing got to do with them linear algebra things (that's not the issue); PDE is intrinsically sequential (hint: compute the serial fraction of a typical FD scheme, just take explicit Euler with NT = 10000, NX = 50, not a matrix in sight but no point parallelizing. QED).I would not want too much time on PDE and GPU. Maybe combining OpenMP and GPU is an option. As mentioned, OpenMP for ADE give speedup 5 on 8-core machine without effort!!But how can you speedup sequential code with OpenMP that can't be done on GPU? There *has* be parallel elements in order to have a case for OpenMP,.. and if so then a GPU has probably some hardware extras that regular CPU scattered across multiple machines just don't have (like fast shared memory)... post the paper when it's done! Ok?To answer this question you first have to know how ADE works. A back-of-envelope tooIf it's sequential then there is no gain in adding processing units / cores, right?If it *is* then there is nothing magical that would allow openMP to process bits in parallel but then somehow that would not be possible with a GPU. if openMP can split a job into 10 bits that can be executed in parallel and which would give a 10 fold speed up, then so should a GPU.Isn't the problem of parallelizing these PDEs tied to the topology of the mesh and the need for each core access values in mesh-adjacent cores? I get the sense that the core-to-core accessibility and bandwidth inside a GPU is particularly poor.