October 11th, 2007, 5:11 pm
> I read somewhere that Jack Dongarra claimed speed up of 10 even with LU and for eigenvalue problems. It seems almost too good to be true but it might be worth a try.Dongarra used iterative methods in mixed precision, not purely in single precision as I suggest. Dongarra's idea is to implement iterative linear algebra methods by using single precision for most steps in the iteration and double precision for the refinement. This works for LU factorizations. Eigenvalue problems in finance are often unstable even in double precision because of the so called pseudo-spectrum problem, but if they are stable than mixed precision works also there. What I prefer is to use linear fast exponentiation instead starting from a time step of around one hour. This method is stable in single precision [not need for mixed precision at all]. The speed up factor is larger as double precision floating point operations on GPUs is 10 times slower than single precision equivalents. So I observe a factor 11-15 speed up on real life applications, not test problems, [and this includes a lot of double precision stuff like the calculation of implied volatilities]. But even so, sensitivities using fast exponentiation in single precision are far smoother than sensitivities using implicit methods and LU factorizations in either double or single. That's because the underlying elementary time step is so small. Smooth sensitivities and single precision stability either come together, or neither is. > Many peoples use double because they are scared of round-off errors.Let me share my own interpretation of why everyone now is using double precision. In the 60s and 70s people knew of fast exponentiation but that remained an academic topic because to implement that algorithm one needs to store full matrices and multiply them. Early computers did not have sufficient memory to even store large full matrices, let alone the parallelism to multiply them well. So people started using sparse matrices and learned how to invert them using iterative methods. In finance, you read PDE methods where they tell you to take a payoff, store it as a vector, do backward induction with weekly time intervals using implicit methods. LU factorizations require either double or mixed precision. So intel produced processors which run faster in double than in single. Single precision floating point numbers need to be casted to double in microcode and that takes time. So even if you use the MKL, single precision matrix multiplication executes about twice slower than double precision multiplication. In a sense there was historically a tradeoff between memory and floating point precision. Since memory was scarce, double precision became the standard and was needed to support sparse matrix methods with inherent instabilities. GPUs come to us from the graphic market and they are optimized for ray tracing algorithms which are just fine with 32 bit floats. Transistors are arranged in a totally different way than in traditional CPUs. GPU cards also have plenty of memory and one can multiply large full matrices. There is no need to use sparsity patterns. No need to take elementary time steps beyond the stability threshold for explicit methods, which is around a few hours. Also no need to program in low level languages like C/C++ and push the optimization. If the bottleneck is on the GPU, the difference between running a program through the debugger and running an optimized version is a marginal 10%. All one needs to to manage the CPU-GPU bus wisely as that becomes the bottle neck. The best way to tell if an algorithm is stable in single precision is to set yourself up with the option of choosing the floating point unit (CPU or GPU) and the floating point precision (32 or 64 bit) dynamically and try out. If one uses fast exponentiation, the error on implied swaption vols is less than one basis point in vol throughout strike from -2% below the forward to +15% above the forward and all tenors/maturities up to 30Y into 20Y. Speed up factors for single precision are around 11-15. It takes 7.62 seconds to price 585 swaptions (including obtaining all the discount factors etc). Perhaps more surprisingly, it takes 11.34 seconds to price 30,240 swaptions. Scaling is highly non-linear because of the massive parallelism in the card. This is with a lattice with 672 sites and time step of one hour. Theoretically, I proved some theorems in the paper i mentioned earlier. As a rule, if sensitivities are smooth, single precision is just fine. If not, single precision doesn't work and when it doesn't, failure typically has quite disastrous and visible effects.