SERVING THE QUANTITATIVE FINANCE COMMUNITY

outrun
Posts: 4573
Joined: April 29th, 2016, 1:40 pm

### Re: Looking for hardware recommendations

I'm currently trying to zoom in on a raw impact bound w.r.t. HT, I try to keep the design noise as low as possible.

Here are my results. I launch all threads at the same time, they all do 1 job, and I have a join at the end. (this is not very handy, I should have N >> T)

* The left is HT OFF, the right ON
* top row is the raw data, I do each experiment 32 times
* middle row is the best out of the 32
* in the bottom row I switch from microseconds to "relative to the durating of a single thread with HT ON"

outrun
Posts: 4573
Joined: April 29th, 2016, 1:40 pm

### Re: Looking for hardware recommendations

Hardware: dual intel Xeon E5620 2.40GHz

katastrofa
Posts: 9665
Joined: August 16th, 2007, 5:36 am
Location: Alpha Centauri

### Re: Looking for hardware recommendations

Sorry for cutting in again. A naive question 1): isn't hyper-threading an inferior option to two proper cores? Does it affect in any way single-core operation performance? 2) For a quad-channel controller, is there any difference between 8 memory sticks 16 GB each and 4 memory sticks 32GB each (the option I've never seen tested so I decided for the first one) ?

Anyway, I've ordered Intel i9-7900X (14nm Skylake-X), because it significantly outscores Razen in practically all relevant to me benchmarks, Asus X299 TUF Mark 1, 8 x 16GB Corsair DDR4 3200MHz RAM, Samsung 960 Pro 512GB SSD (transfer 32GB/s - it's on PCIe bus; with hardware encryption); Corsair PSU... and Oculus Rift
Last edited by katastrofa on August 2nd, 2017, 6:17 pm, edited 1 time in total.

Billy7
Posts: 282
Joined: March 30th, 2016, 2:12 pm

### Re: Looking for hardware recommendations

I'm not confident I understand these results, but I'm at the beach right now and the ambient conditions are outside my brain's normal operating limits. So...does this indicate a mere 11% gain from using HT in this experiment? That is a timing of 1.8 when using all (16) virtual threads, vs that of the expected 2 with HT OFF? .I.e 11% extra virtual cores effectively? As for the fact that up to 8 threads they seem to be in contrast to my finding (whether HT is ON or OFF, using 8 threads results in the same timing and not a 10% difference as I found), couldn't this be because you actually have 2 CPUs here? So maybe the full core resources are used from each CPU up until 8 threads (4+4 physical cores)? After that, say for 12 cores, I can see HT indeed performing worse that NO HT (1.8 vs around 1.6 for no HT?), so confirming my findings w.r.t. this behavior?

outrun
Posts: 4573
Joined: April 29th, 2016, 1:40 pm

### Re: Looking for hardware recommendations

Oh no, you need to leave the beach right away and go to a dark dungeon full of servers! We can't have this because of envy

I'm also not at home right now, but I agree with you. The 12 threads bit is maybe an artifact because I have each thread do a large computation. The 9-16 section also seems to have two plateaus, maybe because of the dualcpu thing? I have a better benchmark in a minute where worker threads consume a large queue of jobs. Like a webserver. That should smoothen things out.

What puzzles me is the low speedup of HT compared to the previous experiment. In this one I just produce a lot of random numbers using a cryptographic function which doesn't contain any IF statement (no branch prediction issues) and sum them up. Another this that is different I'd that I take the "best out of 30 runs". I did that because I saw a clear lowerbound, but what if the MT affects the distribution? Perhaps the mean is a better metric?

Billy7
Posts: 282
Joined: March 30th, 2016, 2:12 pm

### Re: Looking for hardware recommendations

Haha, thanks outrun, the thought of a dark dungeon full of servers (or say a large trading floor with artificial light!) is exactly what I needed in order to appreciate the beach even more and stop moaning about the fact that it's a little crowded today and not entirely perfect

I've never done any experiment with a dual CPU machine,but I suspect it affects the findings and is responsible for the plateaus and possibly also for the apparent low HT speed up. I'm also surprised by that low HT speed up, I expected it to be 25-30% (but not more, unless Linux's default thread scheduling works better for HT than Win, but I doubt that). I think the most unbiased experiment would be with each thread doing a large amount of jobs as you're saying and also ideally turn the second CPU off.
Also, I don't think that the threading software should affect things substantially, but who knows. I use OpenMP.

outrun
Posts: 4573
Joined: April 29th, 2016, 1:40 pm

### Re: Looking for hardware recommendations

Argh, I now see no  difference in enabling / disabling HT! It looks like it stays off!

This is portable C++11 code without any dependencies, if there a bug??
//(C) sitmo 2017
#include <iostream>
#include <vector>
#include <mutex>
#include <chrono>
#include <random>

std::mutex mutex;
long remaining_jobs;
long total_samples;
long total_sum;

void job()
{

// keep running forever (unless there are no more jobs left)
while (true) {

long job_id;

// the job_id is the "remaining job counter", we decrement that after starting a job.
{
std::lock_guard<std::mutex> guard( mutex );

job_id = remaining_jobs;
if (remaining_jobs > 0)
remaining_jobs--;
else
break; // no more jobs
}

// initialize random engine
std::mt19937_64 eng(job_id);

// Compute the sum or 100k  random samples
double local_sum = 0;
for (long i=0; i<1E5; ++i)
local_sum += eng();

// Add the computed sum to the global sum
{
std::lock_guard<std::mutex> guard( mutex );
total_sum += local_sum;
total_samples += 1E5;
}
}
}

// reset global shared variables
total_samples = 0;
total_sum = 0;
remaining_jobs = 1000;

// start the timer
auto start = std::chrono::high_resolution_clock::now();

// Wait for all threads to finish

// stop the timer and report duration in microseconds
auto end = std::chrono::high_resolution_clock::now();
auto ms = std::chrono::duration_cast<std::chrono::microseconds>(end - start).count();
std::cout << nr_threads << "," << ms  << std::endl;

}

int main()
{
for (auto r=0; r<32; ++r) {
for (auto t=1; t<31; ++t) {
}
}
}

outrun
Posts: 4573
Joined: April 29th, 2016, 1:40 pm

### Re: Looking for hardware recommendations

ah.. it was a mistake in the plotting sheet..

Here are the new results, a 10% increase with HT..

Posts: 23951
Joined: September 20th, 2002, 8:30 pm

### Re: Looking for hardware recommendations

Interesting results. I wonder if HT performance is sensitive to the degree that the threads are CPU-bound or I/O-bound? If the CPU is working flat-out and all the data and code sit in registers and cache, then there's little opportunity for the CPU to give resources to a hyper-thread. But if the threads depend on reading from DRAM or more distant devices, then HT would have substantial opportunities.

outrun
Posts: 4573
Joined: April 29th, 2016, 1:40 pm

### Re: Looking for hardware recommendations

I think FT performance it's so poor because each thread does the exact same thing, in my code mainly integer arithmetic.:
...
For example, a CPU core may have several "modules" (execution units) for doing basic integer (whole number) maths and logic, several for doing more advanced maths, several more for loading and storing data from/to memory and so on.
...
It is possible to optimise for hyperthreading if performance is a real issue (high-performance computing, for example) by advising the OS scheduler that certain tasks should be run on certain virtual cores to ensure that, for example, of the two virtual threads on one core, one is mostly doing integer maths and the other is mostly doing floating-point maths.
Last edited by outrun on August 2nd, 2017, 9:04 pm, edited 1 time in total.

outrun
Posts: 4573
Joined: April 29th, 2016, 1:40 pm

### Re: Looking for hardware recommendations

Interesting results.  I wonder if HT performance is sensitive to the degree that the threads are CPU-bound or I/O-bound?  If the CPU is working flat-out and all the data and code sit in registers and cache, then there's little opportunity for the CPU to give resources to a hyper-thread.  But if the threads depend on reading from DRAM or more distant devices, then HT would have substantial opportunities.
That would hold for regular threads too (one being idle, releasing resources for the other).

We just cross posted, I was just trying to figure out ho HT works

As I understand it: each core that many execution units for different things, and it seems HT does a better job at putting them all to work.

outrun
Posts: 4573
Joined: April 29th, 2016, 1:40 pm

### Re: Looking for hardware recommendations

Sorry for cutting in again. A naive question 1): isn't hyper-threading an inferior option to two proper cores? Does it affect in any way single-core operation performance? 2) For a quad-channel controller, is there any difference between 8 memory sticks 16 GB each and 4 memory sticks 32GB each (the option I've never seen tested so I decided for the first one) ?

Anyway, I've ordered Intel i9-7900X (14nm Skylake-X), because it significantly outscores Razen in practically all relevant to me benchmarks, Asus X299 TUF Mark 1, 8 x 16GB Corsair DDR4 3200MHz RAM, Samsung 960 Pro 512GB SSD (transfer 32GB/s - it's on PCIe bus; with hardware encryption); Corsair PSU... and Oculus Rift
1) yes inferior, ..but you get it for free.. On my machine it doesn't affect single-core operation..
2) *maybe* 8x16 has more bandwidth? But I doubt it..
Congratulations!!

edit: that "hardware encryption" is that EAS instructions? It can be used for fast random number generation.

"GCC 4.6+ and Clang 3.2+ provide intrinsic functions for RdRand when -mrdrnd is specified in the flags"

Posts: 23951
Joined: September 20th, 2002, 8:30 pm

### Re: Looking for hardware recommendations

Interesting results.  I wonder if HT performance is sensitive to the degree that the threads are CPU-bound or I/O-bound?  If the CPU is working flat-out and all the data and code sit in registers and cache, then there's little opportunity for the CPU to give resources to a hyper-thread.  But if the threads depend on reading from DRAM or more distant devices, then HT would have substantial opportunities.
That would hold for regular threads too (one being idle, releasing resources for the other).

We just cross posted, I was just trying to figure out ho HT works

As I understand it: each core that many execution units for different things, and it seems HT does a better job at putting them all to work.
My understanding is that HT processors have duplication of the stored state elements such as the registers. Thus, two threads can be resident in the CPU's registers and other core state variables inside a single core. The processor can dynamically swap between threads without I/O simply by shifting whether the computational elements (e.g., the ALUs) are linked to the #1 or #2 thread registers and variables. It's a lot faster than swapping threads which requires storing the old thread's register values and reading in the new thread's values.

outrun
Posts: 4573
Joined: April 29th, 2016, 1:40 pm

### Re: Looking for hardware recommendations

Interesting results.  I wonder if HT performance is sensitive to the degree that the threads are CPU-bound or I/O-bound?  If the CPU is working flat-out and all the data and code sit in registers and cache, then there's little opportunity for the CPU to give resources to a hyper-thread.  But if the threads depend on reading from DRAM or more distant devices, then HT would have substantial opportunities.
That would hold for regular threads too (one being idle, releasing resources for the other).

We just cross posted, I was just trying to figure out ho HT works

As I understand it: each core that many execution units for different things, and it seems HT does a better job at putting them all to work.
My understanding is that HT processors have duplication of the stored state elements such as the registers.  Thus, two threads can be resident in the CPU's registers and other core state variables inside a single core.  The processor can dynamically swap between threads without I/O simply by shifting whether the computational elements (e.g., the ALUs) are linked to the #1 or #2 thread registers and variables.  It's a lot faster than swapping threads which requires storing the old thread's register values and reading in the new thread's values.
That makes sense.
A normal core will also try to break code into parallel executable fragment for different ALUs, so it's mainly a register IO thing?

Posts: 23951
Joined: September 20th, 2002, 8:30 pm

### Re: Looking for hardware recommendations

That would hold for regular threads too (one being idle, releasing resources for the other).

We just cross posted, I was just trying to figure out ho HT works

As I understand it: each core that many execution units for different things, and it seems HT does a better job at putting them all to work.
My understanding is that HT processors have duplication of the stored state elements such as the registers.  Thus, two threads can be resident in the CPU's registers and other core state variables inside a single core.  The processor can dynamically swap between threads without I/O simply by shifting whether the computational elements (e.g., the ALUs) are linked to the #1 or #2 thread registers and variables.  It's a lot faster than swapping threads which requires storing the old thread's register values and reading in the new thread's values.
That makes sense.
A normal core will also try to break code into parallel executable fragment for different ALUs, so it's mainly a register IO thing?
Yes, mainly register I/O but also the program counter, flags, control register, and other state variables. It substantially reduces the time required by a context switch because the core has two contexts preloaded.