Geman & Roncoroni Jump Processes

arifali · July 3rd, 2006, 11:55 am

Hi!Does anyone know how to implement jump process using Matlab? Specifically I want to implement H. Geman and Roncoroni Model for electricity prices. I know that C. Albanese et al have presented a numerical implementation of the model, but I have problems applying it to matlab. Ever so greatful if any one could help me with this!RegardsArifali Hirji

albanese · October 9th, 2007, 4:43 pm

Actually MATLAB is a good platform for the implementation we propose. The keypoint there is that one needs to have access to an efficient implementation of a matrix-matrix multiplication algorithm. Matlab does that for you.I'd be interested to know how come you find the implementation description difficult. Feel free to elaborate as by taking into account your comments we could improve the presentation.Nowadays, the best (low cost) hardware platform for implementing this model would also including hardware acceleration. To use MATLAB, you could for instance use CUDA with a Nvidia card such as the GeForce8800 or a Tesla. One important remark there is that linear fast exponentiation is stable in single precision. I know many may find this hard to believe. LU factorizations, implicit methods, Pade approximants, all require double precision. But linear fast exponentiation with tiny time discretization steps of the order on one hour are stable in single precision. The reason is related to an internal smoothing mechanism which has also the side effect of giving rise to very smooth sensitivities. If you are interested in the math involved behind the smoothing, I'll point you to a review paper I just wrote at http://www.level3finance.com/abelian.pdf . Anyway, for those not interested in the math, just try it. Computing the matrix-matrix product in single precision gives rise to speed up factors of order 15. I am also in the process of writing a paper on swing options using the block-factorization algorithm in this same review paper.There are other hardware platforms though you may want to consider, like the IBM CELL and the Clearspeed boards.

Cuchulainn · October 9th, 2007, 6:08 pm

QuoteLU factorizations, implicit methods, Pade approximants, all require double precision. I read somewhere that Jack Dongarra claimed speed up of 10 even with LU and for eigenvalue problems. It seems almost too good to be true but it might be worth a try.QuoteBut linear fast exponentiation with tiny time discretization steps of the order on one hour are stable in single precision. What is the effect of round-off errors (timy steps ==> more calculations) and is there a way a-priori to tell for which algorithms single precision is good enough?Many peoples use double because they are scared of round-off errors.

albanese · October 11th, 2007, 5:11 pm

> I read somewhere that Jack Dongarra claimed speed up of 10 even with LU and for eigenvalue problems. It seems almost too good to be true but it might be worth a try.Dongarra used iterative methods in mixed precision, not purely in single precision as I suggest. Dongarra's idea is to implement iterative linear algebra methods by using single precision for most steps in the iteration and double precision for the refinement. This works for LU factorizations. Eigenvalue problems in finance are often unstable even in double precision because of the so called pseudo-spectrum problem, but if they are stable than mixed precision works also there. What I prefer is to use linear fast exponentiation instead starting from a time step of around one hour. This method is stable in single precision [not need for mixed precision at all]. The speed up factor is larger as double precision floating point operations on GPUs is 10 times slower than single precision equivalents. So I observe a factor 11-15 speed up on real life applications, not test problems, [and this includes a lot of double precision stuff like the calculation of implied volatilities]. But even so, sensitivities using fast exponentiation in single precision are far smoother than sensitivities using implicit methods and LU factorizations in either double or single. That's because the underlying elementary time step is so small. Smooth sensitivities and single precision stability either come together, or neither is. > Many peoples use double because they are scared of round-off errors.Let me share my own interpretation of why everyone now is using double precision. In the 60s and 70s people knew of fast exponentiation but that remained an academic topic because to implement that algorithm one needs to store full matrices and multiply them. Early computers did not have sufficient memory to even store large full matrices, let alone the parallelism to multiply them well. So people started using sparse matrices and learned how to invert them using iterative methods. In finance, you read PDE methods where they tell you to take a payoff, store it as a vector, do backward induction with weekly time intervals using implicit methods. LU factorizations require either double or mixed precision. So intel produced processors which run faster in double than in single. Single precision floating point numbers need to be casted to double in microcode and that takes time. So even if you use the MKL, single precision matrix multiplication executes about twice slower than double precision multiplication. In a sense there was historically a tradeoff between memory and floating point precision. Since memory was scarce, double precision became the standard and was needed to support sparse matrix methods with inherent instabilities. GPUs come to us from the graphic market and they are optimized for ray tracing algorithms which are just fine with 32 bit floats. Transistors are arranged in a totally different way than in traditional CPUs. GPU cards also have plenty of memory and one can multiply large full matrices. There is no need to use sparsity patterns. No need to take elementary time steps beyond the stability threshold for explicit methods, which is around a few hours. Also no need to program in low level languages like C/C++ and push the optimization. If the bottleneck is on the GPU, the difference between running a program through the debugger and running an optimized version is a marginal 10%. All one needs to to manage the CPU-GPU bus wisely as that becomes the bottle neck. The best way to tell if an algorithm is stable in single precision is to set yourself up with the option of choosing the floating point unit (CPU or GPU) and the floating point precision (32 or 64 bit) dynamically and try out. If one uses fast exponentiation, the error on implied swaption vols is less than one basis point in vol throughout strike from -2% below the forward to +15% above the forward and all tenors/maturities up to 30Y into 20Y. Speed up factors for single precision are around 11-15. It takes 7.62 seconds to price 585 swaptions (including obtaining all the discount factors etc). Perhaps more surprisingly, it takes 11.34 seconds to price 30,240 swaptions. Scaling is highly non-linear because of the massive parallelism in the card. This is with a lattice with 672 sites and time step of one hour. Theoretically, I proved some theorems in the paper i mentioned earlier. As a rule, if sensitivities are smooth, single precision is just fine. If not, single precision doesn't work and when it doesn't, failure typically has quite disastrous and visible effects.

Cuchulainn · October 14th, 2007, 1:05 pm

QuoteDongarra used iterative methods in mixed precision, not purely in single precision as I suggest. Dongarra's idea is to implement iterative linear algebra methods by using single precision for most steps in the iteration and double precision for the refinement. This works for LU factorizationsNice posts. Knowing when and when not to use single precision is a non-trivial task IMHO. Iterative methods with SP probably work because they implement fixed-point thoerems and so a bit of round-off won't hurt convergence. It is just a bit slower. But for LU I do not know why.How can we get guaranteed error bounds in general? Are you using some interval arithmetic?On GPUs, they are very powerful. One major issue - in contrast to OpenMP and MPI - is that the API (it seems??) is propietary and this means non-portable and possibly non-future-proof code.Let's say one wanted to use C++ with CUDA; does it mean a complete rewrite and how much effort would it be. To take an example, OpenMP code is non-intrusive and I can add pragmas at will.

albanese · October 16th, 2007, 12:38 pm

> How can we get guaranteed error bounds in general? Are you using some interval arithmetic?Very good question. The answer is no, that's why this is subtle. If you use interval arithmetics you arrive to worst case error estimates which unavoidably tell you errors explode exponentially in the number of iterations no matter what algorithm you use. But this does not reflect empirical evidence. A way to look at the problem mathematically is to consider this as a case of homogeneization theory. I am writing a math paper on this, but to describe it shortly, the answer depends on the numerical algorithm that is used. If one uses Pade approximants of some sort and time steps are larger than the stablility threshold for explicit methods [which is typically not much longer than one hour or 1/(365*24) years], then errors in the high frequency portion of the probability kernel grow rapidly with time. This can be mitigated by using double precision, lattices fine enough or brute force splining and smoothing at each time step. Sensitivities are affected first because they are coupled to the high frequency noise to a greater degree than price functions themselves. If instead the time step is below the stability threshold of explicit methods [as for linear fast exponentiation which is my favorite method], then a mechanism for internal smoothing sets automatically without the need of splining and smoothing, and errors are simply proportional to the floating point precision noise. They do not explode with the number of iterations because positive and negative errors compensate each other well, self-averaging takes place and high frequency noise in damped by the internal smoothing mechanism itself. Math aside, you can easily verify this empirically. That's how I discovered this in the first place.> The API (it seems??) is propietary and this means non-portable and possibly non-future-proof code.The crucial API you really need is good old BLAS. I heard Nvidia wants to release their BLAS in the public domain. Actually, there's already sgemm code on the web that performs marginally better than CUDA. IBM also and the other manufacturers will always release a standard BLAS anyway as they have always done. So one can safely count on having a BLAS interface optimized by the manufacturer to whatever hardware they come up with. It's the responsibility of the manufacturer to give you that interface. What is more proprietary is the compiler to write device-side code if you choose to go that way. That is a bit shakier, although the CUDA-C compiler is a great starting point. I myself use that kind of code sparingly to do simple things with a demonstrably high performance impact when it is crucial to avoid moving data back and forth from the GPU to the CPU. If the code is well designed, the bus is the bottleneck and device-side code can lessen the need of moving data back and forth. There are other very effective techniques like allocating memory on the host by subtracting it from the OS with a low-level call to a special malloc, not the standard C malloc. Direct memory access which bypasses the OS is around 5 time faster. The new standard PCI-E2 will also help as it runs at twice the clock frequency of the current PCI-Express bus.> Let's say one wanted to use C++ with CUDA; does it mean a complete rewrite and how much effort would it be. It depends on how you structured your code and what algorithms you use. If the algorithms you are used are BLAS centric and the bottleneck is in calls to dgemm/sgemm, then it's no problem at all. I use C# to import dll's because I prefer working with high level managed languages. But native C++ is even simpler as one doesn't have to marshall between native and managed code. The point I want to make is that CPU-side optimization is quite irrelevant if the code is well designed. GPU side optimization can also be taken for granted if one uses manufacturer interfaces. Optimization is more a design issue when managing two separate memory heaps, host side and device side, in such a way to minimize the use of the bus. That is the key factor. If instead you are moving code designed for cluster computing and MPI, then I'd suggest to reconsider from the ground up before you invest on a port. Algorithms that are optimal for cluster computing are not optimal on GPUs. Even numerical analysis textbooks should be taken with a grain of salt at this point in time. This technology is ahead of academia.

Cuchulainn · October 28th, 2007, 7:20 am

Quote> The API (it seems??) is propietary and this means non-portable and possibly non-future-proof code.The crucial API you really need is good old BLAS. I heard Nvidia wants to release their BLAS in the public domain. Actually, there's already sgemm code on the web that performs marginally better than CUDA. IBM also and the other manufacturers will always release a standard BLAS anyway as they have always done. So one can safely count on having a BLAS interface optimized by the manufacturer to whatever hardware they come up with. It's the responsibility of the manufacturer to give you that interface. This is exactly what concerns me. When you look at the history of computing the languages and tools which have stood the test of time are those that are 1) ISO/ANSI/IEEE standard (Cobol, Fortran, C++) and/or 2) have evolved from previous, immature technologies (for example, MPI, OpenMP)QuoteIf instead you are moving code designed for cluster computing and MPI, then I'd suggest to reconsider from the ground up before you invest on a port. Algorithms that are optimal for cluster computing are not optimal on GPUs. Even numerical analysis textbooks should be taken with a grain of salt at this point in time. This technology is ahead of academia.Interesting. Could you elaborate please?///It will not be long until 8-core CPUs become affordable. These work with standard languages like C# and C++. Even now 4-core applications are used. The advantage is that apps can be ported, incrementally.

vladr · October 28th, 2007, 8:03 pm

QuoteOriginally posted by: CuchulainnIt will not be long until 8-core CPUs become affordable. These work with standard languages like C# and C++. Even now 4-core applications are used. The advantage is that apps can be ported, incrementally.It will not be long until 1024-cores GPU configurations become affordable (Quad SLI with 256 PEs per GPU) I would say we will see them by the end of next year. On a mixed- and emulated- precision iterative linear system solvers on a specialized hardware with limited FP accuracy (just SP or even worse) there is a good paper Here The authors have implemented mixed-precision CG and multigrid solvers on a GPU. The results are pretty impressive taking into account that they have tested their code on an old 7900GTX. No CUDA yet and just only 24 PE per GPU.

albanese · October 29th, 2007, 8:53 pm

> This is exactly what concerns me. When you look at the history of computing the languages and tools which have stood the test of time are those that are 1) > ISO/ANSI/IEEE standard (Cobol, Fortran, C++) and/or 2) have evolved from previous, immature technologies (for example, MPI, OpenMP)I would consider BLAS as very much an established standard. You can also program GPUs with C-like languages with preprocessor directives for automatic allocation to a given number of threads and cores without explicit low-level spawning. A bit like openMP but more complex. However, I myself use BLAS as much as possible and in particular sgemm to multiply matrices. Programming by hand a good sgemm is neither easy nor recommendable. QuoteIf instead you are moving code designed for cluster computing and MPI, then I'd suggest to reconsider from the ground up before you invest on a port. Algorithms that are optimal for cluster computing are not optimal on GPUs. Even numerical analysis textbooks should be taken with a grain of salt at this point in time. This technology is ahead of academia.> Interesting. Could you elaborate please?There are three major differences between clusters and GPUs: (i) the amount of fast shared memory available (easily 1GB on GPUs), (ii) the very fast interconnect speed for GPUs (since the cores are on the same chip) and (iii) the different design of floating point units [GPUs run simple precision arithmetics 10 times faster than double precision, CPUs run double precision arithmetics twice faster than single precision since they have to cast internally]. The three considerations apply to both nVidia CUDA and the IBM CELL. Memory availability per se has a huge impact on optimal algorithm design even on CPUs, although this is not broadly appreciated yet. In the days when memory was a scarce resource, people learned how to leverage on sparsity patterns not to have to store big matrices. They also developed all sorts of implicit discretization schemes which are only weakly stable, not strongly stable. If one has more memory available, explicit methods become very competitive. Explicit methods require time steps of around one hour in calendar time, but linear fast exponentiation allows one to arrive to whatever time horizon in a log number of steps. The only real issue when using explicit methods is the need to store matrices and access to a good BLAS to multiply them. If you have over 1GB of physical memory to allocate to an algorithm, then explicit methods are most definitely preferable to implicit methods because they are both faster and more stable. Stability shows up when you evaluate greeks: in a weakly stable implicit method when you price a call option you see high frequency wiggles created near the strike and then amplified as you move backward. If you use fast exponentiation nothing like that happens. You can compute very reliably not only call option prices, but also probability kernels themselves and even a couple of derivatives of probability kernels by discrete differentiation. The very fact of having kernels available has a ton of applications. The consideration above about memory also applies to double precision CPU architectures. The very fast interconnect speed of massively parallel GPUs implies that matrix-matrix multiplication algorithms can be implemented very efficiently. It also implies for instance that it is much faster to price a portfolio all at once than to price each instrument individually. Non-linearity in execution time due to the fast interconnect are huge. If 500 swaptions take 7 seconds, 30000 take 11 seconds when they are priced altogether. Exploiting these nonlinearities due to massive parallelism is another reason why design will be affected. Finally, single precision. Since memory was limited, Intel and AMD optimized processors to double precision. This was meant to run implicit methods for sparse matrices which are only weakly stable and would just break down in single precision. GPUs were originally designed for the graphic market and optimized to single precision. They are only now being readapted to HPC applications. The interesting part of the story is that explicit methods are stable in single precision as the strong stability property helps make explicit methods robust. Even kernels can be reliably computed in single precision using linear fast exponentiation while they cannot be computed in double precision using implicit methods with time steps much longer than the stability threshold for explicit methods [i.e. about an hour]. The conclusion is that since there is sufficient on-chip memory and explicit methods are viable and actually favored, one can also take advantage of the factor 10 speedup due to the optimization to single precision arithmetics. This story is not in textbooks, yet. But my bet is that it will be. The best way to convince yourself of this is to just try it out yourself.