Page **10** of **11**

### Optimized math functions

Posted: **October 20th, 2010, 9:13 am**

by **AVt**

I will have to look at it this evening, can not do it here though I am notsure it will compile in VC2005)The small relative error between the SSE version and mine may indicate thatthe difference is just at the last bits.The factor 2 'towards the end' may have a reason around outrun's suspicion.If IIRC: Ln(2)/2 = L0 + L1 and L0 holds the leading 8 bits, L1 the rest andsubstraction is done in 2 steps to reduce cancellation errors. If the compilerwould 'optimize' that away into 1 step (for speed) that would give errors.The critical point is for k ~ 21.5/ (Ln(2)/2) ~ 21.5/0.35 ~ 60 which is closeto that 8 leading bits ...The other thought (your post Tue Oct 19, 10 07:40 PM): not sure whether thecheck 1 - EXP/expf is correct w.r.t. behaviour of the compiler (as you saidit may use double exp). I do that a bit differently: would you mind to givean additional column in your error listing holding directly the results foryou _mm_myEXP_ps(x) ?

### Optimized math functions

Posted: **October 20th, 2010, 12:38 pm**

by **renorm**

Errors with Visual Studio and GCC. I did convert all values into doubles before computing relative error.

### Optimized math functions

Posted: **October 20th, 2010, 12:53 pm**

by **AVt**

Thx.The error jumping to 2 seems to be a sign problem, in that points _mm_myEXP_ps(x)is negative (but of plausible magnitude)

### Optimized math functions

Posted: **October 20th, 2010, 2:00 pm**

by **renorm**

It is shift operation (2^(k/2) = 1 << (k/2)). Left shit size shouldn't exceed 30. Negative region error is due to the taking of reciprocal exp(-x) = 1/exp(x). Other than that all looks good. Computing the relative error in double precision did fix it.

### Optimized math functions

Posted: **October 20th, 2010, 6:15 pm**

by **AVt**

I took your last error report for further testing, the MSVC version.1) myExp results in relative errors of at most 1 EPS for all inputs,EPS:= 2.0^(-23) for single precision2) The SSE result seriously suffer from using reciprocals for negativeinputs (relative errors ~ 0.001 and worse). 3) For positives the rel. error of SSE is below 1, except 2 cases: x = 15.52299976 and x = 16.16799927, where exp should be5515097.000 and 10511698.00, while your report tells 5515096.000 and 10511696.004) That made repeat 3) for the gcc report: the same (!) behaviourfor those two inputs.Silently I just corrected the sign problem for 3+4 (using abs).Sketching the test: x and f(x) are converted to IEEE singles, whichare rational numbers (=fractions). From that with Maple a referencevalue exp( IEEE(x) ) was computed and normed to IEEE single. Thusthe relative error 1 - IEEE( f(x) ) / IEEE( exp( IEEE(x) ) ) iswhat one can take as a fair value. Or in words: set the input to its IEEE fraction, compute result inhigh precision and then round it to the unique(!) nearest IEEE fraction.After that continue to compute the (relative) error.I am appending only 1 test report: msvc, myExp=SSE version and 0 <= x,new columns: x, myExp, exp(x), absolute error exp(x) - myExp(x), relative error 1 - myExp(x)/exp(x) , all as singles in IEEE 745----For the error through reciprocals in SSE I have no explanation.----What I still find puzzling, that it is slower than cephes.

### Optimized math functions

Posted: **October 25th, 2010, 8:18 pm**

by **renorm**

It works as advertised. The difference between SSE and plain implementation is about 1 epsilon for positive argument values. SSE is broken in the negative region, because it uses EXP(-x)=EXP(x). SSE implementation is broken when x>20 as well.

### Optimized math functions

Posted: **October 26th, 2010, 6:48 pm**

by **AVt**

Yes.I can not get rid of the 1 epsilon and 'sniffing' into the cited Intel sketch it may need toartifically increase precision (especially for the central polynomial, the 1 eps also occurs,if one uses expf there). I think the reason is k*float(ln(2)) versus float(k*ln(2)) and now asmall additive error becomes a small relative one, since exp turns addition to multiplication.For your shifting instead of ldexp: splitting k = k/2 + rest works for me. But for negative kinstead of 1/exp(x) one may have to grope bit presentations - not sure, whether it is worththe effort (if I understood you correctly).Below a more simple version with a relative error ~ 1 eps (no, have no really deep testings).For 'floor' one could use the following FLOOR for float ---> nearest int ---> floor__forceinline int FloatToInt_SSE(const float x) {return _mm_cvt_ss2si( _mm_load_ss(&x) );}__forceinline int FLOOR(const float d) {return (FloatToInt_SSE(d+0.5f)+FloatToInt_SSE(d-0.5f)) >> 1;}Only if all that gives an actual improvement in speed I would try to improve on exp0 as well.But I have doubts that HW supported implementations for floats can be beaten.PS: by enforcing |t| <= ln(2)/2 (but branching) the number of rel errors drop considerably.

### Optimized math functions

Posted: **October 27th, 2010, 6:43 pm**

by **renorm**

We could give it a rest. How about double precision functions?

### Optimized math functions

Posted: **October 27th, 2010, 7:01 pm**

by **AVt**

QuoteWe could give it a rest: agreed.For dbl precision more or less the same applies: shrink to some smallranges by reduction where one can approximate; if not being very small(using advanced methods) one needs rational functions instead of thepolynomials (and one may stumble into branches). But likely it will notbe faster than HW supported/optimized solutions (except you have aspecial situation in mind, which is (not yet) compiler supported). Andif 'correct' rounding is desired (i.e. rel error < 0.5 eps) , than extendingprecision by SW is need (just my 0.02 ? on that).

### Optimized math functions

Posted: **October 29th, 2010, 10:48 am**

by **Cuchulainn**

QuoteOriginally posted by: outrun..and how about 4x exp in parallel? The name "Intel vector math library" would suggest functionality like this. Something like this would be very handy for me. I need to evaluate loads of exp(x)'s and I can imagine that this of parallelism would be optimal.It should also be easily portable to GPU's etc with non-four type of parallelism void exp4(float x[4], float result[4]){..}The main question: what kind of algo structureDo you want a pipeline, fork&join or some kind of Divide and Conquer? And maybe see @renorm's remarks on Intel's TBB.

### Optimized math functions

Posted: **October 29th, 2010, 1:15 pm**

by **renorm**

What outrun wants is called vectorization.I am using Simple SSE math functions and it is working quite well. About 4 times faster math function computation on Visual Studio. GCC can auto-vectorize. But can it vectorize math functions? Vectorization requires signatures similar to this:__m128 exp4(__m128);__m128 is basically an aligned version of float[4]. GCC doesn't have anything built-in like that, at least according to Google.Thread based parallelism is useful only if very large number of math functions can be computed interdependently.

### Optimized math functions

Posted: **October 29th, 2010, 4:41 pm**

by **Cuchulainn**

QuoteOriginally posted by: outrunYes I'm going to use threads as well, because I have 8 cores in my machine that I want to utilize. I'll look up the gcc equivalent, thats my platformSo, loop parallelism? This will give almost linear speedup, if there are no dependencies.