SERVING THE QUANTITATIVE FINANCE COMMUNITY

 
User avatar
outrun
Posts: 4573
Joined: April 29th, 2016, 1:40 pm

Re: Looking for hardware recommendations

August 1st, 2017, 8:53 pm

I'm currently trying to zoom in on a raw impact bound w.r.t. HT, I try to keep the design noise as low as possible. 

Here are my results. I launch all threads at the same time, they all do 1 job, and I have a join at the end. (this is not very handy, I should have N >> T)

* The left is HT OFF, the right ON
* top row is the raw data, I do each experiment 32 times
* middle row is the best out of the 32
* in the bottom row I switch from microseconds to "relative to the durating of a single thread with HT ON"
Image
 
User avatar
outrun
Posts: 4573
Joined: April 29th, 2016, 1:40 pm

Re: Looking for hardware recommendations

August 1st, 2017, 9:00 pm

Hardware: dual intel Xeon E5620 2.40GHz

Compiler: Debian clang version 3.5.0-10, C++11 threads, linked to the pthread library.
 
User avatar
katastrofa
Posts: 8129
Joined: August 16th, 2007, 5:36 am
Location: Alpha Centauri

Re: Looking for hardware recommendations

August 2nd, 2017, 12:20 pm

Sorry for cutting in again. A naive question 1): isn't hyper-threading an inferior option to two proper cores? Does it affect in any way single-core operation performance? 2) For a quad-channel controller, is there any difference between 8 memory sticks 16 GB each and 4 memory sticks 32GB each (the option I've never seen tested so I decided for the first one) ?

Anyway, I've ordered Intel i9-7900X (14nm Skylake-X), because it significantly outscores Razen in practically all relevant to me benchmarks, Asus X299 TUF Mark 1, 8 x 16GB Corsair DDR4 3200MHz RAM, Samsung 960 Pro 512GB SSD (transfer 32GB/s - it's on PCIe bus; with hardware encryption); Corsair PSU... and Oculus Rift :-(
Last edited by katastrofa on August 2nd, 2017, 6:17 pm, edited 1 time in total.
 
User avatar
Billy7
Posts: 282
Joined: March 30th, 2016, 2:12 pm

Re: Looking for hardware recommendations

August 2nd, 2017, 12:30 pm

I'm not confident I understand these results, but I'm at the beach right now and the ambient conditions are outside my brain's normal operating limits. So...does this indicate a mere 11% gain from using HT in this experiment? That is a timing of 1.8 when using all (16) virtual threads, vs that of the expected 2 with HT OFF? .I.e 11% extra virtual cores effectively? As for the fact that up to 8 threads they seem to be in contrast to my finding (whether HT is ON or OFF, using 8 threads results in the same timing and not a 10% difference as I found), couldn't this be because you actually have 2 CPUs here? So maybe the full core resources are used from each CPU up until 8 threads (4+4 physical cores)? After that, say for 12 cores, I can see HT indeed performing worse that NO HT (1.8 vs around 1.6 for no HT?), so confirming my findings w.r.t. this behavior?
 
User avatar
outrun
Posts: 4573
Joined: April 29th, 2016, 1:40 pm

Re: Looking for hardware recommendations

August 2nd, 2017, 1:54 pm

Oh no, you need to leave the beach right away and go to a dark dungeon full of servers! We can't have this because of envy :-)

I'm also not at home right now, but I agree with you. The 12 threads bit is maybe an artifact because I have each thread do a large computation. The 9-16 section also seems to have two plateaus, maybe because of the dualcpu thing? I have a better benchmark in a minute where worker threads consume a large queue of jobs. Like a webserver. That should smoothen things out.

What puzzles me is the low speedup of HT compared to the previous experiment. In this one I just produce a lot of random numbers using a cryptographic function which doesn't contain any IF statement (no branch prediction issues) and sum them up. Another this that is different I'd that I take the "best out of 30 runs". I did that because I saw a clear lowerbound, but what if the MT affects the distribution? Perhaps the mean is a better metric?
 
User avatar
Billy7
Posts: 282
Joined: March 30th, 2016, 2:12 pm

Re: Looking for hardware recommendations

August 2nd, 2017, 3:22 pm

Haha, thanks outrun, the thought of a dark dungeon full of servers (or say a large trading floor with artificial light!) is exactly what I needed in order to appreciate the beach even more and stop moaning about the fact that it's a little crowded today and not entirely perfect :-)

I've never done any experiment with a dual CPU machine,but I suspect it affects the findings and is responsible for the plateaus and possibly also for the apparent low HT speed up. I'm also surprised by that low HT speed up, I expected it to be 25-30% (but not more, unless Linux's default thread scheduling works better for HT than Win, but I doubt that). I think the most unbiased experiment would be with each thread doing a large amount of jobs as you're saying and also ideally turn the second CPU off. 
Also, I don't think that the threading software should affect things substantially, but who knows. I use OpenMP.
 
User avatar
outrun
Posts: 4573
Joined: April 29th, 2016, 1:40 pm

Re: Looking for hardware recommendations

August 2nd, 2017, 8:37 pm

Argh, I now see no  difference in enabling / disabling HT! It looks like it stays off!

This is portable C++11 code without any dependencies, if there a bug??
//(C) sitmo 2017
#include <iostream>
#include <vector>
#include <thread>
#include <mutex>
#include <chrono>
#include <random>


std::mutex mutex;
long remaining_jobs;
long total_samples;
long total_sum;

void job()
{
    
    // keep running forever (unless there are no more jobs left)
    while (true) {

        long job_id;
        
        // the job_id is the "remaining job counter", we decrement that after starting a job.
        {
            std::lock_guard<std::mutex> guard( mutex );
            
            job_id = remaining_jobs;
            if (remaining_jobs > 0)
                remaining_jobs--;
            else
                break; // no more jobs
        }
        
        // initialize random engine
        std::mt19937_64 eng(job_id);

        // Compute the sum or 100k  random samples
        double local_sum = 0;
        for (long i=0; i<1E5; ++i)
            local_sum += eng();
    
        // Add the computed sum to the global sum
        {
            std::lock_guard<std::mutex> guard( mutex );
            total_sum += local_sum;
            total_samples += 1E5;
        }
    }
}

void run_threads(int nr_threads) {

    // reset global shared variables
    total_samples = 0;
    total_sum = 0;
    remaining_jobs = 1000;
    
    // start the timer
    auto start = std::chrono::high_resolution_clock::now();
    
    // Launch worker thread pool
    std::vector<std::thread> threads;
    for(int i=0; i<nr_threads; ++i)
        threads.push_back( std::thread( job ) );

    // Wait for all threads to finish
    for(auto& thread : threads)
        thread.join();

    // stop the timer and report duration in microseconds
    auto end = std::chrono::high_resolution_clock::now();
    auto ms = std::chrono::duration_cast<std::chrono::microseconds>(end - start).count();
    std::cout << nr_threads << "," << ms  << std::endl;
        
}

int main()
{
    for (auto r=0; r<32; ++r) {
        for (auto t=1; t<31; ++t) {
            run_threads(t);
        }
    }
}
 
User avatar
outrun
Posts: 4573
Joined: April 29th, 2016, 1:40 pm

Re: Looking for hardware recommendations

August 2nd, 2017, 8:46 pm

ah.. it was a mistake in the plotting sheet..

Here are the new results, a 10% increase with HT..
Image
 
User avatar
Traden4Alpha
Posts: 23951
Joined: September 20th, 2002, 8:30 pm

Re: Looking for hardware recommendations

August 2nd, 2017, 8:57 pm

Interesting results. I wonder if HT performance is sensitive to the degree that the threads are CPU-bound or I/O-bound? If the CPU is working flat-out and all the data and code sit in registers and cache, then there's little opportunity for the CPU to give resources to a hyper-thread. But if the threads depend on reading from DRAM or more distant devices, then HT would have substantial opportunities.
 
User avatar
outrun
Posts: 4573
Joined: April 29th, 2016, 1:40 pm

Re: Looking for hardware recommendations

August 2nd, 2017, 9:01 pm

I think FT performance it's so poor because each thread does the exact same thing, in my code mainly integer arithmetic.:
...
For example, a CPU core may have several "modules" (execution units) for doing basic integer (whole number) maths and logic, several for doing more advanced maths, several more for loading and storing data from/to memory and so on.
...
It is possible to optimise for hyperthreading if performance is a real issue (high-performance computing, for example) by advising the OS scheduler that certain tasks should be run on certain virtual cores to ensure that, for example, of the two virtual threads on one core, one is mostly doing integer maths and the other is mostly doing floating-point maths.
Last edited by outrun on August 2nd, 2017, 9:04 pm, edited 1 time in total.
 
User avatar
outrun
Posts: 4573
Joined: April 29th, 2016, 1:40 pm

Re: Looking for hardware recommendations

August 2nd, 2017, 9:03 pm

Interesting results.  I wonder if HT performance is sensitive to the degree that the threads are CPU-bound or I/O-bound?  If the CPU is working flat-out and all the data and code sit in registers and cache, then there's little opportunity for the CPU to give resources to a hyper-thread.  But if the threads depend on reading from DRAM or more distant devices, then HT would have substantial opportunities.
That would hold for regular threads too (one being idle, releasing resources for the other).

We just cross posted, I was just trying to figure out ho HT works :-)

As I understand it: each core that many execution units for different things, and it seems HT does a better job at putting them all to work.
 
User avatar
outrun
Posts: 4573
Joined: April 29th, 2016, 1:40 pm

Re: Looking for hardware recommendations

August 2nd, 2017, 9:16 pm

Sorry for cutting in again. A naive question 1): isn't hyper-threading an inferior option to two proper cores? Does it affect in any way single-core operation performance? 2) For a quad-channel controller, is there any difference between 8 memory sticks 16 GB each and 4 memory sticks 32GB each (the option I've never seen tested so I decided for the first one) ?

Anyway, I've ordered Intel i9-7900X (14nm Skylake-X), because it significantly outscores Razen in practically all relevant to me benchmarks, Asus X299 TUF Mark 1, 8 x 16GB Corsair DDR4 3200MHz RAM, Samsung 960 Pro 512GB SSD (transfer 32GB/s - it's on PCIe bus; with hardware encryption); Corsair PSU... and Oculus Rift :-(
1) yes inferior, ..but you get it for free.. On my machine it doesn't affect single-core operation..
2) *maybe* 8x16 has more bandwidth? But I doubt it..
Congratulations!!

edit: that "hardware encryption" is that EAS instructions? It can be used for fast random number generation.

"GCC 4.6+ and Clang 3.2+ provide intrinsic functions for RdRand when -mrdrnd is specified in the flags"
 
User avatar
Traden4Alpha
Posts: 23951
Joined: September 20th, 2002, 8:30 pm

Re: Looking for hardware recommendations

August 2nd, 2017, 9:34 pm

Interesting results.  I wonder if HT performance is sensitive to the degree that the threads are CPU-bound or I/O-bound?  If the CPU is working flat-out and all the data and code sit in registers and cache, then there's little opportunity for the CPU to give resources to a hyper-thread.  But if the threads depend on reading from DRAM or more distant devices, then HT would have substantial opportunities.
That would hold for regular threads too (one being idle, releasing resources for the other).

We just cross posted, I was just trying to figure out ho HT works :-)

As I understand it: each core that many execution units for different things, and it seems HT does a better job at putting them all to work.
My understanding is that HT processors have duplication of the stored state elements such as the registers. Thus, two threads can be resident in the CPU's registers and other core state variables inside a single core. The processor can dynamically swap between threads without I/O simply by shifting whether the computational elements (e.g., the ALUs) are linked to the #1 or #2 thread registers and variables. It's a lot faster than swapping threads which requires storing the old thread's register values and reading in the new thread's values.
 
User avatar
outrun
Posts: 4573
Joined: April 29th, 2016, 1:40 pm

Re: Looking for hardware recommendations

August 2nd, 2017, 10:21 pm

Interesting results.  I wonder if HT performance is sensitive to the degree that the threads are CPU-bound or I/O-bound?  If the CPU is working flat-out and all the data and code sit in registers and cache, then there's little opportunity for the CPU to give resources to a hyper-thread.  But if the threads depend on reading from DRAM or more distant devices, then HT would have substantial opportunities.
That would hold for regular threads too (one being idle, releasing resources for the other).

We just cross posted, I was just trying to figure out ho HT works :-)

As I understand it: each core that many execution units for different things, and it seems HT does a better job at putting them all to work.
My understanding is that HT processors have duplication of the stored state elements such as the registers.  Thus, two threads can be resident in the CPU's registers and other core state variables inside a single core.  The processor can dynamically swap between threads without I/O simply by shifting whether the computational elements (e.g., the ALUs) are linked to the #1 or #2 thread registers and variables.  It's a lot faster than swapping threads which requires storing the old thread's register values and reading in the new thread's values.
That makes sense.
A normal core will also try to break code into parallel executable fragment for different ALUs, so it's mainly a register IO thing?
 
User avatar
Traden4Alpha
Posts: 23951
Joined: September 20th, 2002, 8:30 pm

Re: Looking for hardware recommendations

August 2nd, 2017, 10:44 pm

That would hold for regular threads too (one being idle, releasing resources for the other).

We just cross posted, I was just trying to figure out ho HT works :-)

As I understand it: each core that many execution units for different things, and it seems HT does a better job at putting them all to work.
My understanding is that HT processors have duplication of the stored state elements such as the registers.  Thus, two threads can be resident in the CPU's registers and other core state variables inside a single core.  The processor can dynamically swap between threads without I/O simply by shifting whether the computational elements (e.g., the ALUs) are linked to the #1 or #2 thread registers and variables.  It's a lot faster than swapping threads which requires storing the old thread's register values and reading in the new thread's values.
That makes sense.
A normal core will also try to break code into parallel executable fragment for different ALUs, so it's mainly a register IO thing?
Yes, mainly register I/O but also the program counter, flags, control register, and other state variables. It substantially reduces the time required by a context switch because the core has two contexts preloaded.
ABOUT WILMOTT

PW by JB

Wilmott.com has been "Serving the Quantitative Finance Community" since 2001. Continued...


Twitter LinkedIn Instagram

JOBS BOARD

JOBS BOARD

Looking for a quant job, risk, algo trading,...? Browse jobs here...


GZIP: On