Does anyone aware of any optimized implementation of exp, sqrt and log functions? Compiler supplied implementation is designed for precision, not speed. In case one doesn't need 15 digit precision, there should a way to speedup those math function.Any thoughts?

That might be just what I need. Thanks.Btw, I it is single precision, not double. The good new is that, each function does 4 floats per call. MS compiler intrinsics are all double precision. (Yes/No?) I presume that all intrinsics benefit from SSE2. (Yes/No?)

List of suspects: Intel Compiler's Math Library (ICML), Julien Pommier's SSE math library (SSE_Math), MS intrinsic math functions (MS).Platform: Visual Studio 2008, WinXP box running on Cure 2 duo.Math functions tested: exp, log, sin, cos.Remarks: ICML is a part of Intel's compiler suite. ICML automatically replaces cmath functions with optimized intrinsics. MS automatically replaces stock cmath.And the verdict is...Normalized single procession performance: MS (1), SSE_Math (4), ICML (6).ICML single precision performance is 6 times faster than MS and 1.5 times faster than SSE_Math.But...ICML must be used in a loop. The loop must be auto-vectorized by the compiler. Not all loops can be auto-vectorized. Using std::vector instead of plain array will privet vectorization. Aliased pointer will prevent vectorization. Many other things can prevent vectorization. Intel Compiler's manual provides some info on how to facilitate vectorization.SSE_Math has full single precision accuracy. SSE_Math can compute sin and cos of the same argument in 1 pass with almost zero overhead. It can be 8 times faster than MS in some circumstances.

Last edited by renorm on October 4th, 2010, 10:00 pm, edited 1 time in total.

4% error is too much.ICML and MKL are free to try for 1 month.

what are your "specifications"? you kept it a bit vague (intentionally?)for 'fast' one could always glance what the gaming/visual guys suggest,for 'compact' for those engineers around embedded systems,for 'correct' (but poor) for 'ancients' before IEEE normingsfor correct and not poor -s.th. like Moshier's cephes or SUN as approach

i just coded LUT for exp and it's ~10 faster than stock version, precision can be tuned, in my case it's at lelast 5 digits

Can you post the actual speed?Here are my measurements for single precision exp.CPU: 2.33Ghz Core2 duo (single threaded test).Intel Math Library Speed: 291M per secSSE_Math Speed: 187M per secAn array of 2^20 numbers (1 : 2^20)*1e-5 was used as an input (range~ (0, 10.5)).The output was stored in another array of length 2^20 and then compared with stock exp. The relative error didn't exceed 2^-22. Everything was cycled 100 times and the total time measured. Using smaller/larger array with more/less cycles doesn't make any noticeable difference. Cache issues are less important in number crunching.

Last edited by renorm on October 5th, 2010, 10:00 pm, edited 1 time in total.