```
Mtest[a_,b_,z_,trials_]:=
Timing[Do[M[a,b,z],{trials-1}];
M[a,b,z]]
```

Thanks again for your experiments. I just tested with the above Mathematica code. The result gives the timing (in secs) to do the computation 'trials' times, and then the M value.

Taking 10,000 trials:

Mtest[1.,4.,50 I,10000]

{1.185608,0.00241259 +0.0599983 I}

Mtest[3.,10.,30+100. I,10000]

{1.747211,-13811.6-953.439 I}

Mtest[15.,20.,200. I,10000]

{1.326008,-2.82866*10^-6-3.27839*10^-6 I}

Mtest[4.,1.,50. I,10000]

{0.046800,-9044.44-18975.1 I}

Mtest[10.,3.,30+100. I,10000]

{0.078001,-1.8468*10^20+9.77876*10^21 I}

Mtest[4000.,4200.,50. I,10000]

{0.249602,-0.868055-0.468886 I}

So, roughly, the results are [0.05 - 1.75] secs to do 10,000 evals. To make a port to C/C++ (or FORTRAN or whatever) worth the effort, I would need a factor of 10x improvement on these times or better, as well as a similar speedup with the Bessel [$]I_{\nu}(z)[$] function. (Also, [$]M(a,b,z)[$] needs to admit complex 'a' and 'z', and [$]I_{\nu}(z)[$] needs complex z). I know you haven't paid attention to speed, but that's where I am on the ballpark requirements.

I did some experiments on C++ performance fro CHF using the series solution based on 3.2 b

In general, C++ _much_ faster than MM (in fairness, MM has to be robust over a range of value; BTW does it use GMP internally?)

I have tests

A. using double

B. using float (bit faster than A)

C. A + OpenMP

For A, I look many trials and here are the running times in seconds (N = 10 terms in series, E-8 accuracy)

(10^5, 10^6, 10^7, 10^8, 10^9, 1/2 10^10)

0.03, 0.47, 4.22, 34, 334, -)

Using cote i5 (4cores/4 threads OR 2 core + 4 threads hyperthread?) I get

10^5, 10^6, 10^7, 10^8, 10^9, 1/2 10^10)

0.03, 0.27, 2.27, 12.5, 124, 586)

The speedup is noticeable.

Of course, an algo for all values of z is needed, but I cannot see how GPU offers a better solution. The problem has to be 'big' enough to warrant its use. It's like going to the supermarket to do shopping I a Ferrari. I parallelized the 3.2b itself with N = 10 and it was much slower than the sequential code.

// For trials = 10,000 I get running time of ~ 0.05 rather consistently.

My own plan to is get a machine with 10 cores + onboard double precision(!) GPU in the short term. I'll write a business plan and send to Santa. I'm hoping they don't pull the plug when the powers that be see the electricity bill!!!