December 27th, 2009, 1:01 pm
QuoteOriginally posted by: kevin08What is incremental approach?One particular scenario://Some Guidelines on Parallelising C++ Code (v 0.1)When developing multi-threaded applications one of the options is to attempt to rewrite single-threaded code to take advantage of the available multi-core and multi-processor hardware. To this end, we decide on a multi-threaded version of the original sequential code. The *goals* of this migration project are:. Improved speedup: the program should run faster on a CPU with N processors than a CPU with one processor. Linear speedup on a CPU with N processors is equivalent to saying that the program runs N times faster than the same program on a single-processor CPU. A program whose serial fraction is greater than 10% has a maximal speedup of 10 in the limit when the number N of processors tends to infinity!. Serial equivalence: does the multi-threaded program give the same results and output as the (original) single-threaded program? This can be a hard requirement, for example in option pricing codeHow do we proceed and how long should the porting process take? First of all, we need to decide if it is *feasible* to port our sequential code. There are some issues to address before we embark on this project:. The current application code is so badly documented and/or designed that there is no point trying to port it. It will take too long to understand what the intent it, in which case it is better to design the problem again using parallel design techniques and threading code. . What is the programs serial fraction? This is defined as the fraction of the programs execution time taken up by the parts that must execute serially. Knowing what the serial fraction is will allow us to determine (or find an upper bound) on the speedup using the Amdahl and Gustafson-Barsis laws. You should use these laws as a back-of-envelope estimate for project GO/NOGO.. Micro-optimisation: in some situations we may find ourselves trying to parallelise code that runs almost as fast in the single-threaded case. Multi-threaded code is suitable for compute-intensive algorithms and problems with large data sets. There is an application size threshold below which single-threaded code has better speedup than multi-threaded code! Having determined that the effort in producing a multi-threaded solution is worth the returns, we need to define a process that will realise the dual goals of speedup and serial equivalence. We adopt an *incremental* approach by step-wise migration of sequential code to multi-threaded code. There are no golden rules here but the following guidelines and checks may be useful:. Create a back-of-envelope data dependency graph. Do you know what the main tasks in the program are and their dependencies?. What are the main algorithms in the program? What are the glue data structures than connect these algorithms. Which patterns are being used, for example, dataflow, object-oriented (hierarchical), event-driven?Answering these questions will give an indication of where to begin with code migration. Some issues are:1) Many technical applications contain code containing for-loops. These are ideal candidates for parallelization (in principle). The caveats are:a) Silent data corruption; this corresponds to shared data being accessed by multiple threads, leading to non-deterministic behaviour. Data needs to be protected.b) Cache problems: the data in a for-loop may not fit into cache. Consider loop optimization methods, for example loop fusion, fission and unrollingc) Avoid critical sections in loops; use reduction variables2) Non-thread code and libraries: for example, a Monte Carlo option pricer that employs non-thread-safe random-number generators will give different results in the single-threaded and multi-threaded cases when different threads are used to generate these numbers. It is possible to place the code in a critical block but this will have disastrous performance implications in double for-loops. A possible solution is some form of loop fission or using a data flow pattern such as Producer-Consumer3) Round-off errors and reduction variables: when parallelizing reductions when the type is floating point the result in the multi-threaded case may not be the same as in the single-threaded case.4) Last but not least, knowledge of the underlying hardware is essential if good speedup is wanted, especially L1, L2, L3 cache, pinning threads to processors,
Last edited by
Cuchulainn on December 26th, 2009, 11:00 pm, edited 1 time in total.