If a very simple MC experiment: you have N (large number) of jobs and T (small number) worker threads. Each job is something like "do 1.000 Monte Carlo draws from some distribution and report back the average".
If the efficiency is 100%, and if a single thread does 1 job is 1 sec, then all jobs are expected to be processed in ceil[N/T] seconds, right? However, with C cores this will be ceil[N/C] if T is a multiple of C?
(I'm working on it now! I'll have plots in 10 min I hope)
I suppose you have something along the lines of tasks for 1 job:
1. Path Evolve (many paths)
2. RNG (single or multiple threads)
3. Path assembler
4. Pricer
Threads are possible; what about C++ futures which do load balancing and scheduling automatically?
(The OS scheduler knows how to schedule better than developer-based code IMO).