November 9th, 2010, 11:18 am
QuoteOriginally posted by: renormAnything which has to store and move tons of data won't speedup on GPU. Multifactor Longstaff-Schwartz is one such example. Longstaff-Schwartz method requires tons memory to store paths and significant CPU time to perform linear regression at every time step. Matrix algebra doesn't easily benefit from multithreading, but SIMD can be applied even to small matrices with good results. This is not quite true. The GPU is specifically designed for processing huge quantities of data (originally for real-time, high resolution 3D rendering). The GPU uses GDDR5 RAM as device memory, and can have up to 6GB of GDDR5 RAM. The CPU, on the other hand, uses DDR3 memory. The difference is that GDDR5 memory has a peak bandwidth of over 225GB/s, while the fastest DD3 RAM has a peak bandwidth of 25GB/s. It is also easier write GPU code that can achieve close to peak bandwidth, than CPU code, due to the GPU's ability to automatically hide memory access latency and provide the programmer with direct control over the movement of data through the memory hierarchy. So the real GPU versus CPU memory performance difference is potentially higher than 10x. The catch is that GDDR5 RAM is optimized reading and writing to data in parallel on large contiguous blocks of memory, so your algorithim's memory access patterns must be such that it reads and writes large contiguous blocks of memory in parallel. This should be possible to do in the Longstaff-Schwartz method, unless I am missing something. As for the linear regression step, you are correct that matrix algebra doesn't easily benefit from CPU multi-threading, but this does not necessarily hold true for GPU "threads." GPU "threads" get mapped onto a grid of SIMD vector processors on the GPU, so whatever performance gain is possible with traditional SIMD version of matrix multiply should also work with GPU threads, ìf I'm not mistaken. The GPU also allows you to precisely control the movement of data between GDDR5 "device memory" and on-chip "shared memory" registers, so my intuition is that the same SIMD logic could be better optimized on the GPU.