wilmott.com

hasmanean

<t>QuoteOriginally posted by: wwmchQuoteOriginally posted by: renormThe language of CUDA is only a modified C (some features of C++ are also supported). It is not difficult if you are familiar with C/C++. However, if we only translate the C/C++ code to CUDA, the performance won't be good at all. The...

hasmanean

<t>It's nice that thrust has such low overhead on the GPU, although the numbers for thrust::host_vector on the CPU look a little slow. Was the library compiled from source and optimized for your particular processor, or did it use a precompiled binary? Maybe if you turned on all the right optimizati...

hasmanean

<t>Well massive threading (massively parallel threading ) plays a role in the performance analysis. With CUDA, even if 1% of the threads will branch and 99% will not (and so finish their task quickly), threads are not run individually (like on a CPU), rather up to 32 threads at a time ( known as a "...

hasmanean

<t>Originally IIRC, cuda originally did not support if statements inside GPU kernels, this was added later in version 2. As long as you access the data sequentially, and as long as there is enough register space available then there is enough leeway in how you process it to not affect performance to...

hasmanean

<t>The two pass algorithm works better, provided that the result of doing processing on "illegal" inputs (which the branch identifies) does not affect the output. GPUs really only care that memory be accessed in a linear, sequential fashion, so to do that there are 128-bit wide datatypes defined lik...

hasmanean

<t>Small detail, I'm assuming that since the branch conditional is evaluated inside the loop, the data has to be loaded into general purpose registers in scalar form first and packed into the vector (the SSE registers) manually. So for the vectorized case M>1 it should read N*(1/M + P + (1 - 0.99^M)...

hasmanean

Have you looked at sparse matrices?

hasmanean

<t>Back in the 1980s processors were slower than memory, so optimizing meant reducing the number of arithmetic operations you carried out. The bottleneck today is memory access latency, so the point is to optimize the data access patterns. Both GPUs and Intel processors can do arithmetic operations ...

Search found 8 matches

A short test on the code efficiency of CUDA and thrust

A short test on the code efficiency of CUDA and thrust

GPU vs SIMD

GPU vs SIMD

GPU vs SIMD

GPU vs SIMD

How to do fast element-wise array operations in C?

GPU vs SIMD