Hi guysI'm writing code for my uni project. I understand tridiagonal solvers are inherently serial in nature and best parallel algorithms offer O(logN) complexity in theory. However, using cuSparse library I only start to see around 1.3x speedup when the diagonal size is around 10k elements. In terms of implicit finite difference this is 10K space steps which is excessive. For dense matrix multiplication I see around 11x speedup.Does anyone have experience of writing such solvers in parallel (cuda or otherwise) and getting better speedup? Should I explore other ways of doing this e.g via eigenvalue/vectors to compute the inverse?Thx

QuoteOriginally posted by: ashkar Should I explore other ways of doing this e.g via eigenvalue/vectors to compute the inverse?Maybe -- what's the underlying problem?

Thanks for replying guys.Costeanu - that was a good link but notations are difficult to follow. It looks similar to reduction algorithms but on LU decomposed matrix. outrun - N-number of diagonal elements ~ Number of serial steps. You're right, my statement doesnt make sense. If I do some variants of forward elimination, this would be N steps even in parallel. The reduction stage can be LogN but overall it doesnt look like I can achieve much speedup compared to serial.Alan - My matrix is diag={1-2alpha(i)}, upperDiag=-alpha(i), lowerDiag=-alpha(i). This is for implicit FD solution for Black PDE (like in Wilmott's first book) with local vols thus the alpha(i) terms are spot and vol dependent and thus not constant. Is there any nice properties of such tridiag matrices that would make inversion computation less expensive? I could reformulate the solution in terms of new matrix diag={2,2,..2}, u/l diag={-1,...,-1} terms where i can directly compute the eigenvalues and vectors as described in section 3.4 of http://www.inf.ethz.ch/personal/gander/ ... cyclic.pdf (By the way what is this matrix called?)The problem is that my gpu card is pretty old and it has only 16 multiprocs with only one core per proc for special functions such as sin and cos.I'm doing a project on smile models (LV and Heston). I wanted to put an additional section on a parallel processing (using cuda) if it doesnt take too much time. (I have another 3 weeks before i submit my project). Is it reasonable to say that speedup is not expected in Implicit FD case especially given the overheads of the GPU? I do get very good speedup (11x) using my old video card on euler scheme which is simple mat-vec.

Let's say you are trying to solve du/dt = A u with A the described sparse array and u(0) given? (If your problem is not in this form, I would try to reformulate it so it is.)Then, in Mathematica, which I use, this would be quite tractable with quite large A, say size N x N = 10^4 x 10^4 you mention.You just give it to the built-in ODE system solver.Alternatively, as you mention, a spectral approach might work. However, there's no need to compute an inverse.Again in Mathematica or your favorite CAS, just ask for the first n (orthonormal) eigenvectors [$]v_n[$]/eigenvalues [$]\lambda_n[$] of A,where n is small (< 10). Then, see what kind of convergence and run-time you get with(*) [$] u(T) = \sum_{j=1}^n e^{\lambda_j T} (v_j, u(0)) \, v_j[$], where [$](a,b) = a \cdot b[$], the usual vector inner product. [The solution (*) with [$]n = N[$] is exact, and follows from the spectral theorem for (in this case), real, symmetric matrices]. You will want the first (or last) n eigenvalues, with the appropriate ordering so these leading terms are dominant.Sometimes, this kind of thing convergences extremely rapidly; it depends on T and the eigenvalue spacing.Indeed, the relative error in (*) is [$]O(e^{(\lambda_{n+1} - \lambda_1) T})[$], so the larger T, the better. So, to my mind, a conventional CAS approach should handle this kind of thing, making parallelization(or indeed writing any significant amount of code in any system at all) unnecessary ..

Last edited by Alan on August 7th, 2014, 10:00 pm, edited 1 time in total.

Thanks Alan Yes thats the matrix system I was trying to solve. Thats very interesting method, I didnt know about this method. I will try it in matlab. For academic purpose, it will be very easy to parallelise it aswell.

- Cuchulainn
**Posts:**62913**Joined:****Location:**Amsterdam-
**Contact:**

QuoteOriginally posted by: ashkarAlan - My matrix is diag={1-2alpha(i)}, upperDiag=-alpha(i), lowerDiag=-alpha(i). This is for implicit FD solution for Black PDE (like in Wilmott's first book) with local vols thus the alpha(i) terms are spot and vol dependent and thus not constant. Is there any nice properties of such tridiag matrices that would make inversion computation less expensive? I could reformulate the solution in terms of new matrix diag={2,2,..2}, u/l diag={-1,...,-1} terms where i can directly compute the eigenvalues and vectors as described in section 3.4 of http://www.inf.ethz.ch/personal/gander/ ... cyclic.pdf (By the way what is this matrix called?)The problem is that my gpu card is pretty old and it has only 16 multiprocs with only one core per proc for special functions such as sin and cos.I'm doing a project on smile models (LV and Heston). I wanted to put an additional section on a parallel processing (using cuda) if it doesnt take too much time. (I have another 3 weeks before i submit my project). Is it reasonable to say that speedup is not expected in Implicit FD case especially given the overheads of the GPU? I do get very good speedup (11x) using my old video card on euler scheme which is simple mat-vec.Maybe take a step backwards from hand-crafted and traditional solvers (for linear PDE only) and have a look at Boost ODEint and gpuWriting your own time stepper is kind of pointless in a sense; especially parallelising matrix LU is something to be give to a _specialised_ ODE solver NDSolve, odeint and NAG. See examples BTW implicit Euler is O(dt) accurate and L-stable while Crank-Nicolson is O(dt^2) but only A-stable. QuoteYou just give it to the built-in ODE system solver.With 3 weeks to go odeint might be a bridge too far.

Last edited by Cuchulainn on August 7th, 2014, 10:00 pm, edited 1 time in total.

Step over the gap, not into it. Watch the space between platform and train.

http://www.datasimfinancial.com

http://www.datasim.nl

http://www.datasimfinancial.com

http://www.datasim.nl

Thanks Cuch. I'll try that after the project as I dont have much time to change the code too much at this time. It'll be interesting to see the performance difference. I spent a lot of time yesterday debugging numerical differences in local vol and heston calculations on gpu v cpu! All my variables and constants are floats in msvc and nvcc (cuda compiler) and yet it seems like msvc computes everything as doubles and then casts the results back into floats. Switching the floating point setting in visual studio makes no difference. Eventually the differences are very large ~10^-4 between cuda and msvc implementation.E.g. f(x+dx), f(x-dx) and f(x) all show differences in 10th decimal place (through the debugger) in cuda and msvc. However the result of f(x+dx) + f(x-dx) -2*f(x) shows differences from the 4th decimal place onwards. If I break it down, then its the subtraction part that introduces the difference. At first, I thought it might be some kind of optimisation going wrong so I've tried various different ways to calculate the above but the results are always the same.Any experience with such kind of issues visual c++ compiler?

Thats an interesting article. I'll try that approach of setting a binary convenient dx term rather than 0.01.After reading further about it, it looks like the intermediate double calculations are due to the compiler storing variables and constants in registers. GPUs typically have thousands of registers so it must be storing all temps in registers. So unless the gpu and cpu have the same register width, its going to be very difficult to put an upper bound on the error. I wish I was more proficient with cuda to dig in deeper...maybe one for after submitting the project.

- Traden4Alpha
**Posts:**23951**Joined:**

You might also be able to show that some tridiagonals cannot be solved by GPUs due to the use of floats, not doubles.The trickier problem is in knowing a priori which numerical systems are well-conditioned enough to be solved with floats, not doubles.And if you want a really worth-while challenge, then see if there's a way to solve ill-conditioned problems with floats.

- Cuchulainn
**Posts:**62913**Joined:****Location:**Amsterdam-
**Contact:**

QuoteThe trickier problem is in knowing a priori which numerical systems are well-conditioned enough to be solved with floats, not doubles.The techniques for this are documented but it is lost in action I suppose because double is so pervasive. Why float??

Last edited by Cuchulainn on August 11th, 2014, 10:00 pm, edited 1 time in total.

Step over the gap, not into it. Watch the space between platform and train.

http://www.datasimfinancial.com

http://www.datasim.nl

http://www.datasimfinancial.com

http://www.datasim.nl

QuoteOriginally posted by: CuchulainnQuoteThe trickier problem is in knowing a priori which numerical systems are well-conditioned enough to be solved with floats, not doubles.The techniques for this are documented but it is lost in action I suppose because double is so pervasive. Why float??My gpu has one double precision unit per SM so i use floats everyone for a fair comparison. In general it will be useful to develop further understanding of this as even newer cards dont have full support for doubles.

- Cuchulainn
**Posts:**62913**Joined:****Location:**Amsterdam-
**Contact:**

QuoteOriginally posted by: ashkarQuoteOriginally posted by: CuchulainnQuoteThe trickier problem is in knowing a priori which numerical systems are well-conditioned enough to be solved with floats, not doubles.The techniques for this are documented but it is lost in action I suppose because double is so pervasive. Why float??My gpu has one double precision unit per SM so i use floats everyone for a fair comparison. In general it will be useful to develop further understanding of this as even newer cards dont have full support for doubles.I suppose you could carry out analysis like this approach but there again it might be overkill. A lost art.floating point error analysis

Last edited by Cuchulainn on August 11th, 2014, 10:00 pm, edited 1 time in total.

Step over the gap, not into it. Watch the space between platform and train.

http://www.datasimfinancial.com

http://www.datasim.nl

http://www.datasimfinancial.com

http://www.datasim.nl

GZIP: On