GPU vs SIMD

Pannini · November 9th, 2010, 11:18 am

QuoteOriginally posted by: renormAnything which has to store and move tons of data won't speedup on GPU. Multifactor Longstaff-Schwartz is one such example. Longstaff-Schwartz method requires tons memory to store paths and significant CPU time to perform linear regression at every time step. Matrix algebra doesn't easily benefit from multithreading, but SIMD can be applied even to small matrices with good results. This is not quite true. The GPU is specifically designed for processing huge quantities of data (originally for real-time, high resolution 3D rendering). The GPU uses GDDR5 RAM as device memory, and can have up to 6GB of GDDR5 RAM. The CPU, on the other hand, uses DDR3 memory. The difference is that GDDR5 memory has a peak bandwidth of over 225GB/s, while the fastest DD3 RAM has a peak bandwidth of 25GB/s. It is also easier write GPU code that can achieve close to peak bandwidth, than CPU code, due to the GPU's ability to automatically hide memory access latency and provide the programmer with direct control over the movement of data through the memory hierarchy. So the real GPU versus CPU memory performance difference is potentially higher than 10x. The catch is that GDDR5 RAM is optimized reading and writing to data in parallel on large contiguous blocks of memory, so your algorithim's memory access patterns must be such that it reads and writes large contiguous blocks of memory in parallel. This should be possible to do in the Longstaff-Schwartz method, unless I am missing something. As for the linear regression step, you are correct that matrix algebra doesn't easily benefit from CPU multi-threading, but this does not necessarily hold true for GPU "threads." GPU "threads" get mapped onto a grid of SIMD vector processors on the GPU, so whatever performance gain is possible with traditional SIMD version of matrix multiply should also work with GPU threads, ìf I'm not mistaken. The GPU also allows you to precisely control the movement of data between GDDR5 "device memory" and on-chip "shared memory" registers, so my intuition is that the same SIMD logic could be better optimized on the GPU.

renorm · November 10th, 2010, 4:02 pm

QuoteIf you are doing Monte Carlo simulations, large matrix operations, large time series operations, etc, then GPUs will yield higher performance per dollar. If, on the other hand, you are doing some kind of low latency work on the CPU then you would use SIMD wherever possible to optimize performance. I would be very wary if I saw SSE SIMD code in some regular (non low-latency) quant library ... it would be a possible indicator that either someone has gone hog wild with SIMD optimization for a problem that doesn't require that kind of optimization, or someone has a massively parallel problem that would be much better suited to GPUs than to SSE SIMD code. SSE is easier than CUDA. CUDA vs SSE is not a clear-cut even for non-low latency algorithms. Optimization with stream processing restrictions doesn't seem to be an easy task. Learning curves are very different too. STL, Boost, templates and vanilla OOP are not available on CUDA.Longstaff-Schwartz seems to be less GPU friendly. Tall skinny SVD will benefit from SIMD (both long vector and short vector), but the implementation is quite complex. There is also a step involving collecting in-the money path from all over the global memory. Some clever memory pattern might be needed to improve performance. I am currently reading some papers how to do it. The theory is applicable both to GPU and CPU.

hasmanean · November 12th, 2010, 7:12 pm

Back in the 1980s processors were slower than memory, so optimizing meant reducing the number of arithmetic operations you carried out. The bottleneck today is memory access latency, so the point is to optimize the data access patterns. Both GPUs and Intel processors can do arithmetic operations in something like 4 clock cycles, but the time to load data from memory can vary immensely, 200 clock cycles for a GPU, and anywhere from 1 cycle (if the data is in L1 cache) to more if is in RAM. Intel processors devote something like 70% of their silicon area to memory caches, which are designed to hide memory latency from the programmer/software, essentially mimicking the computer hardware that existed 30 years ago. Intel processors sit there and do nothing while waiting for data to arrive from memory. GPUs have so many threads that they can task-switch and can do something useful while one thread is blocked on data.Another question is , how small is your working data set, and how much computation do you want to do on it? If it can fit within the chips L2 cache, maybe go for SIMD. If the fixed overhead of a GPU kernel call and DMA transfer is justified, use a GPU.

Pannini · November 16th, 2010, 8:52 am

QuoteOriginally posted by: renormSSE is easier than CUDA. CUDA vs SSE is not a clear-cut even for non-low latency algorithms. Optimization with stream processing restrictions doesn't seem to be an easy task. Learning curves are very different too. STL, Boost, templates and vanilla OOP are not available on CUDA.CUDA 3.0 does support templates and OOP.

renorm · November 16th, 2010, 10:22 am

Currently am working on a project which is interesting from GPU programming point of view too. Sometime ago I download CUDA code and it had templates. Why it came to mind mind as if CUDA has no templates, I don't know. I thought that OOP (a.k.a GOF design patterns) is not very relevant where CUDA is used. Does CUDA support virtual functions?

ww250 · November 16th, 2010, 11:33 pm

simd just means single instruction multiple data, it's a computer architecture term, as vesus to mimd or sisd, nowadays data engines can do superscalar simd with a lot more microarchitecture features.

renorm · November 17th, 2010, 6:20 am

That is right. The title of this thread should have been "SSE vs GPU". Is MIMD another fancy term for vanilla multithreading? Is there low latency MIMD implementation. Newer SSE version do have some intrinsics that can be classified as MIMD.

Cuchulainn · November 17th, 2010, 6:31 am

QuoteOOP (a.k.a GOF design patterns)Not necessary/not necessarily:I have seem lots of OOP minus GOF.

Cuchulainn · November 17th, 2010, 9:23 am

Thanks. And from Robert Demming.

endian675 · November 17th, 2010, 12:22 pm

Anybody who can make Boost understandable to the average c++ programmer deserves a medal, possibly a knighthood.

renorm · November 17th, 2010, 1:37 pm

QuoteI have seem lots of OOP minus GOFWhat is OOP? Unless we define it it, OOP can be anything and anything minus GOF can be anything too .I was referring to the dynamic polymorphism and subtyping with classes. CUDA is all about optimization. Optimal path is hardware dependent and requires hand tuning. That is true for CPU too. There is no use for dynamic polymorphism in an optimized BLAS library. All quality optimized numerical libraries are written in C or FORTRAN. Generic programming can be used too, but I guess C++ wasn't available when many numerical codes were written.

Cuchulainn · November 17th, 2010, 2:20 pm

QuoteWhat is OOP?You tell me.

Polter · November 17th, 2010, 3:30 pm

QuoteOriginally posted by: renormQuoteI have seem lots of OOP minus GOFWhat is OOP? Unless we define it it, OOP can be anything and anything minus GOF can be anything too .I was referring to the dynamic polymorphism and subtyping with classes. CUDA is all about optimization. Optimal path is hardware dependent and requires hand tuning. That is true for CPU too. There is no use for dynamic polymorphism in an optimized BLAS library. All quality optimized numerical libraries are written in C or FORTRAN. Generic programming can be used too, but I guess C++ wasn't available when many numerical codes were written.Well, if we can agree that a numerical-linear-algebra-library is-a numerical-library, then I think Eigen (pure C++ & SSE intrinsics, stand-alone library /not just a wrapper over some BLAS implementation/) would be a counterexample to this claim:http://eigen.tuxfamily.org/index.php?ti ... zationWhat do you think?