Yes, I don't think it would be the most portable RNG in the world (it violates points 1 and 5 above).
It is a strange mix of numerical analysis (Lebesgue) and statistics (rejection). Two 'correlated' convergence measure are being used. You can decrease the number of samples needed by taking more vertical strips but then you have to fit them in memory.
And the 3-4 static arrays will probably create havoc with L1 cache memory.
I suppose the algo implicitly assumes 32 bit single processor (the method is very old). It works in practice but how does it work in theory and for all the other platforms?
A search threw up this
The fastest method in software is the ziggurat method (Marsaglia 2000). This procedure uses a complicated method to split the Gaussian probability density function into axisaligned rectangles, and it is designed to minimize the average cost of generating a sample. However, this means that for 2 percent of generated numbers, a more complicated route using further uniform samples and function calls must be made. In software this is acceptable, but in a GPU, the performance of the slow route will apply to all threads in a warp, even if only one thread uses the route. If the warp size is 32 and the probability of taking the slow route is 2 percent, then the probability of any warp taking the slow route is (1 - 0.02)32, which is 47 percent! So, because of thread batching, the assumptions designed into the ziggurat are violated, and the performance advantage is destroyed.