Page 7 of 39

Parallel RNG and distributed MC

Posted: January 29th, 2012, 8:01 pm
by AVt
Ok (though I do not use Excel, I interface code to it).

Parallel RNG and distributed MC

Posted: February 3rd, 2012, 8:04 pm
by Polter
outrun, have you played with C++ AMP yet? It looks quite interesting, http://www.danielmoth.com/Blog/C-AMP-Op ... ation.aspx

Parallel RNG and distributed MC

Posted: February 3rd, 2012, 8:20 pm
by Polter
QuoteOriginally posted by: outrun QuoteOriginally posted by: Polteroutrun, have you played with C++ AMP yet? It looks quite interesting, http://www.danielmoth.com/Blog/C-AMP-Op ... ion.aspxNo haven't looked at that. Have you?Not yet, a bit wary to install VC++11 preview live and no time to setup a virtual infrastructure just for that. I'm thinking on reserving time to spend on that when the Visual C++ 11 release comes out. So far I've been reading the docs and following conferences materials, looks quite promising (much more C++ friendly than CUDA, perhaps more than Thrust, too). Apparently some of the array concepts have close connection to recent additions to high-performance Fortran, according to Herb Sutter: http://channel9.msdn.com/Events/BUILD/B ... 663397.The main premise seems to be the same platform for CPU and GPU, although from what I'm seeing in the specs right now the CPU part seems to be SIMD vectorization features, rather than multicore.Although it is possible to make use of both: http://www.danielmoth.com/Blog/Running- ... U.aspxMore info:http://www.gregcons.com/KateBlog/DidYou ... /TOOL-802T

Parallel RNG and distributed MC

Posted: February 3rd, 2012, 8:47 pm
by Polter
QuoteOriginally posted by: outrunQuoteOriginally posted by: PolterQuoteOriginally posted by: outrun QuoteOriginally posted by: Polteroutrun, have you played with C++ AMP yet? It looks quite interesting, http://www.danielmoth.com/Blog/C-AMP-Op ... ion.aspxNo haven't looked at that. Have you?Not yet, a bit wary to install VC++11 preview live and no time to setup a virtual infrastructure just for that. I'm thinking on reserving time to spend on that when the Visual C++ 11 release comes out. So far I've been reading the docs and following conferences materials, looks quite promising (much more C++ friendly than CUDA, perhaps more than Thrust, too). Apparently some of the array concepts have close connection to recent additions to high-performance Fortran, according to Herb Sutter: http://channel9.msdn.com/Events/BUILD/B ... 663397.The main premise seems to be the same platform for CPU and GPU, although from what I'm seeing in the specs right now the CPU part seems to be SIMD vectorization features, rather than multicore.Although it is possible to make use of both: http://www.danielmoth.com/Blog/Running- ... U.aspxMore info:http://www.gregcons.com/KateBlog/DidYou ... L-802TSIMD is fine, right? I think it's a bit tricky to decide if it's worth the effort. Can't say.. I've read that the Fermi GPU's support C++ code (instead of C -something to do with memory integration), and I'll have to see if it takes more that 1-2 days to learn CUDA, and how easy/difficult it is to write code that runs on both. I prefer simplicity over elegance.Sure, it might be hard to tell right now.For me the portability aspect of the open spec looks interesting, with cross-platform heterogeneous computing as an explicit goal:"Abstracts acceleratorsCurrent version supports DirectX 11 GPUsCould support others like FPGAs, off-site cloud computing...Support heterogeneous mix of acceleratorsExample: C++ AMP can use both an NVidia and AMD GPU in your system at the same time"// http://www.nuonsoft.com/blog/2012/01/23 ... "Microsoft supports and encourages anyone to implement the C++ AMP open specification on any platform, and we are actively working with interested parties already. If you are a compiler, hardware, or operating system vendor who is interested in C++ AMP support for your platform, read the spec and feel free to get in touch."// http://blogs.msdn.com/b/nativeconcurren ... hed.aspxSo, whether it's interesting from my point of view depends on whether the stated goals are successfully achieved -- any platform (any hardware, any OS), any compiler, any vendor, all in STL-style modern C++ -- this would indeed mean bringing something new to the table, sufficiently better to switch from CUDA. I will continue to keep track on this, when the official release of VC++11 comes out and I get my hands on it (or earlier (if I become impatient and risk going with the Dev. Preview)/later -- depends on whether ceteris paribus holds) I think I'll give it a shot and report back!

Parallel RNG and distributed MC

Posted: February 3rd, 2012, 8:49 pm
by Polter
QuoteOriginally posted by: outrun I'm looking at the boost normal random number generator code now -I want to use the same interface for MC sample generators-... It's looks very slow! It does have a cache: it creates two rnd's at a time, and hands them out one-by-one. however, check out the log, sqrt, sin, cos in the code below. Those are very expensive. The ziggurat method is much faster. I'm going to implement and benchmark the different methods.BTW, there's an implementation of ziggurat algo in QL, perhaps it can be of help:http://quantlib.sourcearchive.com/docum ... 92527.html

Parallel RNG and distributed MC

Posted: February 3rd, 2012, 8:55 pm
by Polter
At the same time, ziggurat algo does some extra branching which kills performance for GPU[0] -- so, using BM (as Boost does) for GPU (Chapter 37 from GPU Gems 3 (linked below) uses Tausworthe plus Box-Muller), while using ziggurat (as QL does) for CPU might be the optimal way to go.[0]QuoteThe fastest method in software is the ziggurat method (Marsaglia 2000). This procedure uses a complicated method to split the Gaussian probability density function into axisaligned rectangles, and it is designed to minimize the average cost of generating a sample. However, this means that for 2 percent of generated numbers, a more complicated route using further uniform samples and function calls must be made. In software this is acceptable, but in a GPU, the performance of the slow route will apply to all threads in a warp, even if only one thread uses the route. If the warp size is 32 and the probability of taking the slow route is 2 percent, then the probability of any warp taking the slow route is (1 - 0.02)^32, which is 47 percent! So, because of thread batching, the assumptions designed into the ziggurat are violated, and the performance advantage is destroyed.Source: http://http.developer.nvidia.com/GPUGem ... h37.html// yes, there's really "http." in the hostname, this is not a typo ;-)

Parallel RNG and distributed MC

Posted: February 3rd, 2012, 10:03 pm
by Polter
QuoteOriginally posted by: outrun This might be nice too: Wallace algorithm, vectorizable,..Yeah, it (the Wallace, 1996 one) is discussed in this GPU Gems link I've posted just before :-), together with some trade-offs compared to B-M when used on GPUs.Still, perhaps the 2010 article you've encountered offers some new insights on the implementation?