Page **1** of **17**

### Looking for hardware recommendations

Posted: **October 13th, 2016, 6:12 pm**

by **Alan**

I am starting to search for a higher performance desktop than my current system. Mainly I need an Intel 8 core system because I am doing some Mathematica stuff that needs the performance boost. (Mathematica's std license will launch up to 8 kernels if you have 8 or more cores. I have just discovered how easy it is invoke all the cores at once with my particular Mathematica app, but I only have 4 at the moment).

Currently I have a Dell XPS desktop. Shopping at Dell, I learned their only 8 core system is an Alienware Area-51 R2. The rep says he expects 8 core to migrate to their XPS systems pretty soon, but didn't have an announced launch date. I am not buying until Jan, so may wait. The Alienware seems somewhat pricey for what you get, but I would certainly like to hear any experiences with that model.

Any recommendations/experiences with *any* manufacturer using Intel 8 (or more) core systems. If you have one, what is your cooling system?

### Re: Looking for hardware recommendations

Posted: **October 13th, 2016, 6:39 pm**

by **outrun**

Intel has a Xeon CPU for high end machines, e.g. this one has 20 cores:

http://ark.intel.com/products/93812/Int ... e-1_80-GHz
I have a supermico server with *dual*-CPU,.. that's another way to scale thing up (have two CPUs instead of one). Servers are typically like pizza boxes, but supermicro also has desktops

https://www.supermicro.nl/products/syst ... 038A-i.cfm with a dual CPU motherboard

The hardware is very reliable, they have $2bln annual revenues

### Re: Looking for hardware recommendations

Posted: **October 13th, 2016, 7:01 pm**

by **Alan**

Thanks -- I'll check it out. I see the clock speed is only 1.8Ghz. I wonder if this can compete with that Alienware I mentioned, which uses the i7-5960X (and which they say can be overclocked up to 4Ghz). I am assuming Mathematica is invoking 8 cores on a single CPU, but will have to check out how hard it is to invoke 2 CPUs.

BTW, do you happen to have Mathematica on your machine?

### Re: Looking for hardware recommendations

Posted: **October 13th, 2016, 7:52 pm**

by **outrun**

I think that's memory clock speed?

Did a quick search, and

here is a review of the motherboard where they run it with 2x

Intel Xeon E5-2690 v3 running at 2.60 Ghz but you can overlock them to 3.50Ghz.

Each CPU has 12 cores and 24 threads. They do some tricks to turn each core into "2 fake cores" named threads. It scales quite good.

I've posted this before: this plot shows the performance of my server (2x Intel Core i3-2100) which has 8 cores and 16 threads running a MC job (I think it was *your* experimenting with distributed computing a long long time ago!). You can see that the first 8 parallel jobs scale linear, the next 8 start using the threads which scale less good -but still substantial-

My machine runs Linux and I unfortunately don't have Mathematica

I expect that both machines

Alienware or Supermicro will be very similar? Maybe Alienware is more retail oriented (and they have wicket cased ;D) an Supermicro more corporate. Servers in general tend to be more conservative, putting reliability over performance.
Another thing to consider is to add a GPU -or at least have the option to add them later (slots, power supply)-. I bought two GeForce GTX 970 for a couple of hundred Euro and some job now run 20x faster!

### Re: Looking for hardware recommendations

Posted: **October 14th, 2016, 12:18 am**

by **Alan**

Thanks again -- lots of good ideas there.

I am still learning how to scale my Mathematica app efficiently (it is a maximum likelihood optimization). When I invoke 4 slave Mathematica kernels, I get about a factor 3x speedup. After my orig. post above, I learned I could invoke 8 slave kernels. Trying that, my total CPU usage increased from about 50% of maximum to about 96% -- unfortunately the run-time only decreased by a further 25%. So, one thing is to work on while shopping around is to see if I can get better scaling -- something closer to your graph up to where the CPU is maxed out. There is probably a lot of inter-process communication going on in my case that Mathematica manages automatically and this kills the hoped-for linearity.

### Re: Looking for hardware recommendations

Posted: **October 14th, 2016, 1:07 am**

by **Traden4Alpha**

Thanks again -- lots of good ideas there.

I am still learning how to scale my Mathematica app efficiently (it is a maximum likelihood optimization). When I invoke 4 slave Mathematica kernels, I get about a factor 3x speedup. After my orig. post above, I learned I could invoke 8 slave kernels. Trying that, my total CPU usage increased from about 50% of maximum to about 96% -- unfortunately the run-time only decreased by a further 25%. So, one thing is to work on while shopping around is to see if I can get better scaling -- something closer to your graph up to where the CPU is maxed out. There is probably a lot of inter-process communication going on in my case that Mathematica manages automatically and this kills the hoped-for linearity.

Is there a chance you are hitting memory or memory bandwidth issues? Unless the code is really tight and can fit in cache (which is another hardware spec to look for), all the cores will be sharing the same bus to RAM.

Perhaps you can find a machine with wider/faster memory architecture.

### Re: Looking for hardware recommendations

Posted: **October 14th, 2016, 1:13 am**

by **Traden4Alpha**

P.S.

CPU Benchmarks might help you pick the right machine.

Good luck!

### Re: Looking for hardware recommendations

Posted: **October 14th, 2016, 8:11 am**

by **outrun**

1. Tweak the parallelisation in Mathematica

Maybe you can present the problem differently to Mathematica so that it can better break it up in parallel chunks? I bet the the ML will involve some costly gradient calculation for each element in a large set of elements (options, returns, etc), then sum things up to get a gradient step?

One thing you can try it to break up the gradient computation explicitly in N subset gradients (use n different variable?), allocate each manual to a kernel, and then -when the parallel computation is done- merge them yourself? Dive deeper into parallel optimisation possibilities within Mathematica.
2. Re-implement for speed

If you're looking for speed for this specific problem -because it's needed in production,.. or very often in general- (instead of for one off experiments you prefer to keep doing in Mathematica), then you could switch away from Mathematica to e.g. python or C/C++. It might even be able to run it on a GPU.

Amazon's new P2 GPU instances https://aws.amazon.com/ec2/instance-types/p2/ and those are amazing machines. The rent is only $0.90 / hr / GPU. If you were able to utilize the GPU power (which is tricky, depends on the problem but there are lots of high level tools to help with that) then you can get 1 Teraflop per GPU (trillion floating point operations / sec). To compare: a high end dual Xeon E5 utilising all its cores does 0.25 Teraflop. My 2 year old Macbook Air does 0.02 Teraflop. The drawback is that it'll cost quite a bit of time to develop.
3. Improve the search algorithm

If you just want shorter run time while still using Mathematica then there might also be additional options on top of parallel (or not) executing. In my experience focussing on order of convergence with better algorithms can give a much bigger speedup then the constant factor you can get from utilising you cores. (a better initial guess is also something with potential high impact). In Deep Learning the learning phase is also an optimisation problem (cost minimization). I use Tensorflow which has some high level convenience functions to speed up the optimisation: it supports GPUs out of the box, it does automatic differentiation (for the gradient) and it has various gradient descent algorithms to pick from. Some are much faster than other.

Most of the time people use "

Adam" which a momentum gradient descent based method

### Re: Looking for hardware recommendations

Posted: **October 14th, 2016, 2:49 pm**

by **Alan**

Thanks -- that's a good link. Re memory, I did have a serious memory link yesterday -- now fixed. Memory bandwidth might be issue; I assume if I move to a new 8+ core system, I will reap the advantage of newer motherboards, etc.

@outrun,

Thanks for the further suggestions. Yesterday was my first parallelization run in Mathematica, so there is lots to explore there on improving the calculation. It is a maximum likelihood optimization where a key time sink is simply calculating the likelihood. But that is just a sum of terms, with run-time proportional to the length of the data. So I distribute the terms of the sum over the cores using a Mathematica built-in called ParallelSum[..]. Turned out to be easy to implement: maybe a couple of hours of reorganizing my code. It's still a very lengthy computation for me, so the various things on your list are all important alternatives.

Re a C/C++ version, I thought a little about that several days ago. The main bottleneck there is I need two special functions with

all complex arguments: the Bessel I(nu,z) and the confluent hypergeometric M(a,b,z). Googling, I could only locate one library supporting those and it was not really set up for Windows, my current environment. If anybody knows such a library that can easily be hooked in to Visual Studio on Windows, I'd appreciate a link. (Some usual suspects, Boost and GSL, don't have the complex argument support, as far as I could tell).

### Re: Looking for hardware recommendations

Posted: **October 14th, 2016, 3:43 pm**

by **outrun**

That sounds that you can reach parallel speedup within Mathematica. That supermicro motherbord states that you can go up to 22 cores (or 44 threads),.. but I'm sure it won't be cheap.

The boost Bessel function don't support complex numbers. It looks that pyhon SciPy does, but I doubt it will be faster than mathematica.

I would either stick wth Mathematica and a many core machine, or if that doesn't work and you need 100x more speed then I would go and find a C or C++ library and try to get it to run on a GPU.

### Re: Looking for hardware recommendations

Posted: **October 14th, 2016, 3:56 pm**

by **outrun**

Another idea: if it's a low dimensional problem then you could build a lookup table with precomputed likelihood values. It will have a bit lower accuracy but I don't think that matters much since you have a lot of terms that you sum. Each term calculation can use the same lookuptable

### Re: Looking for hardware recommendations

Posted: **October 15th, 2016, 11:37 am**

by **Cuchulainn**

Re a C/C++ version, I thought a little about that several days ago. The main bottleneck there is I need two special functions with all complex arguments: the Bessel I(nu,z) and the confluent hypergeometric M(a,b,z). Googling, I could only locate one library supporting those and it was not really set up for Windows, my current environment. If anybody knows such a library that can easily be hooked in to Visual Studio on Windows, I'd appreciate a link. (Some usual suspects, Boost and GSL, don't have the complex argument support, as far as I could tell).

Yeah, typically a Fortran era function that has not (yet) been ported to C++. It was another generation of mathematicians.

What about the following: Use the Kummer's series representation for CHF and implement the former using stuff in Boost Math (I would assume that complex numbers are supported at that level, all thing being equal). If not, it is not difficult to get an initial working model directly.

http://www.boost.org/doc/libs/1_54_0/li ... ation.html
Equation (12) looks doable IMO

http://www.ece.mtu.edu/faculty/wfp/arti ... l_math.pdf
And it is obvious that parallel is needed (e.g. do you know Visual Studio PPL library, Parallel Aggregate if using (12), very easy to use) and possibly (Boost) multiprecision. Even easier, is a one-liner in OpenMP 2.0 (shipped with Visual Studio) and a thread-safe reduction variable to parallelise the loop (for example, you can get a speedup of 5 on a 8-core machine for 2 factor option pricing with ADE/FDM). BTW does your GPU support double precision and C99 maths?

Here is OpenMP piece as file...

In 2016 two of my MSc students did C++ AMP library for GPU and is much easier to use than the CUDA interface. If you decide to go down the GPU road I would advise C++ AMP in the short term for you unless there are compelling reasons for not doing so. Based on my contacts they say it takes 3-4 months to learn CUDA. The learning curve with C++ AMP is much shorter.

I have theses public domain for both CUDA and C++ AMP. Give me a shout if you are interested in copies.

Of course, parallel multicore + GPU on one machine also offers possibilities

Then load balancing is the name of the game.

At the end of the day, the peculiarities of the algorithm determines which design pattern and platforms is most suitable.

### Re: Looking for hardware recommendations

Posted: **October 15th, 2016, 4:21 pm**

by **Alan**

Thanks -- that's good to learn about the existence of C++AMP rather than CUDA -- maybe not for this specific project, but for a follow-up one that I have (more Monte Carlo related).

I am loath to attempt to write these special functions myself. For the gurus who understand both unix and windows, how much trouble do you think it would be to get

this math library compiled and then linkable to C/C++ code compiled under Visual Studio on Windows? (It seems to have many dependencies on unix-type stuff).

### Re: Looking for hardware recommendations

Posted: **October 15th, 2016, 5:04 pm**

by **outrun**

I think it will be involved. That lib requires ARB (arblib.org) and

setting that up is only described for Linux -which is someones overcomeable-, but that in turns requires 3 other linux libraries.

It thing either inplement those two functions (see if you can find the algorithms n one of these lib, without any dependencies to other functions), or run a virtual Linux machine on your windows machine and install everything there.

### Re: Looking for hardware recommendations

Posted: **October 15th, 2016, 5:15 pm**

by **outrun**

Implementing those two function might also make it easier to get it running on a GPU?

I also like that arblib, I still need to do that American put with arbitrary precision to analyze convergence speed, analyze stability, and compute the put value with 100 digits.