Well I didn't bring any code with me on my laptop so that I can switch off on holidays, but since you posted some I couldn't resist trying it, that geek in me...
It's a 2 physical - 4 logical core CPU (i5-5200U), but the BIOS option to turn OFF HT is locked, so I can only run your test with HT ON.
It's Windows 10 and I compiled it with with VS2015. I only tried your test up to the 4 available hardware threads, repeated 32 times as your code has it and picked the min, so here you go:
NoThreads Time in microsecs
Comparing the results for 1 and 4, we see that the HT gain is 29%, pretty much as expected. Not sure why you get 10%, but we must find out! Is it Linux vs Windows? Is it that dual CPU setup? Is it the compiler? Again, I suspect he dual CPU thing,but it may be something else.
Comparing the results for 2 and 4 (as was that test of yours from years ago), would make us believe the HT gain is 39%, which it isn't. Using 2 of the 4 available threads seems to not do as well as using two physical cores with one thread each (that is it doesn't halve the time of 1), as I had found before on my desktop too. Moreover, that timing for 2 threads is by far the most volatile of the four in those 32 retries.