Distributed Finite Difference

d32 · October 3rd, 2007, 8:14 am

Does current technology in numerical analysis allow finite difference algorithms to be split up and run on, say, a grid, like MC could? If so, can anyone provide references?

Cuchulainn · October 3rd, 2007, 12:09 pm

QuoteOriginally posted by: d32Does current technology in numerical analysis allow finite difference algorithms to be split up and run on, say, a grid, like MC could? If so, can anyone provide references?There's lots of stuff here. I have done some on OpenMP and MPI. Here is a useful book. hereHemholtz equation using OpenMP///////////////////////But the design and algorithms will be different from 'traditional' FDM. For ex, Schwarz method. In fact, he laid the basis for parallel FDM more than 100 years ago.here

gradstudent12 · October 8th, 2007, 2:16 pm

of courseMost basic idea with MPI:Domain Decomposition:You have a grid, you split up grid into N roughly equal partitions. Give each processors one of these partitions of domain.You have some (finite-difference) scheme which updates the quantites at each lattice site. This scheme requires some information from neighboring lattice sites. Near the edges of each processor's local data, that processor will need data from an adjacent processor's memory to update the quantity at that site. You figure out what you need to do the updating, then have the adjacent processor send that data to this processor using MPI. optimizations:at every time step- first post non-blocking send and receives for the boundary data needed from other processors- While the data transfer is happening, then compute everything that you can that does not depend on this data.- post an mpi wait which will not return untill the data has been recieved.- compute quantities that depend on the boundary data- go to next time step Things to keep in mind:- minimize number of MPI send and receives, maximize size of each send and recieve.

Cuchulainn · October 8th, 2007, 3:57 pm

QuoteOriginally posted by: gradstudent12of courseMost basic idea with MPI:Domain Decomposition:You have a grid, you split up grid into N roughly equal partitions. Give each processors one of these partitions of domain.You have some (finite-difference) scheme which updates the quantites at each lattice site. This scheme requires some information from neighboring lattice sites. Near the edges of each processor's local data, that processor will need data from an adjacent processor's memory to update the quantity at that site. You figure out what you need to do the updating, then have the adjacent processor send that data to this processor using MPI. optimizations:at every time step- first post non-blocking send and receives for the boundary data needed from other processors- While the data transfer is happening, then compute everything that you can that does not depend on this data.- post an mpi wait which will not return untill the data has been recieved.- compute quantities that depend on the boundary data- go to next time step Things to keep in mind:- minimize number of MPI send and receives, maximize size of each send and recieve.What kinds of speedup can you expect from this approach?

CompPDE · October 8th, 2007, 4:25 pm

it depends what solver you are using as to what speed up to attain.it is widely known that certain schemes are not "scaleable" and benefit of parallelism stops after n* number of procs where the overhead of passing messages of neighbouring points slows the solve to a crawlit is very straightforward as to how to parallelize FD schemes etc as gradstudent12 mentioned. Theres some other advanced stuff you can do also

Cuchulainn · October 8th, 2007, 7:00 pm

QuoteOriginally posted by: CompPDEit depends what solver you are using as to what speed up to attain.it is widely known that certain schemes are not "scaleable" and benefit of parallelism stops after n* number of procs where the overhead of passing messages of neighbouring points slows the solve to a crawlit is very straightforward as to how to parallelize FD schemes etc as gradstudent12 mentioned. Theres some other advanced stuff you can do alsoAgain, it's speedup that is important. And if you use shared or distributed memory. For the latter, we get a speedup of 1.5 for duo-core and must look up what we got for MPI.Whatever way, it's not embarassingly parallel... Quoteoverhead of passing messages of neighbouring points slows the solve to a crawlSo use shared memory.

gradstudent12 · October 8th, 2007, 7:28 pm

I have an explicit scheme so it is easy to make parallel -- I get Log(running time) = -.94 * Log(Num processors) from a fit of a series of runs of the same problem just varying the # of processors used (up to 512 processors).My scalability is limited only by the ratio of the extra rows needed to manage info from adjacent processors to the total number of local rows per processors.

Traden4Alpha · October 8th, 2007, 7:38 pm

In the mechanical engineering world for FDM and FEM (Finite Element Methods) the parallelizing approaches looks to minimize the ratio of interface nodes between partitions to interior nodes within each partition. This reduces the amount in inter-partition messaging/data coordination relative to letting each processor work within a partition. They sometimes use energy minimization ideas to morph the boundaries between partitions.In any case, the higher the dimensionality of the lattice, the worse things get because the ratio of surface area to volume of the partitions gets much worse.

albanese · October 14th, 2007, 11:48 am

Traditional parallelism is achieved with clusters/MPI but the emerging new trend is to use massively parallel GPUs and manufacturer provided BLAS. The two strategies are quite different. With clusters you try to use sparse matrices without storing them and apply them or invert them on vectors. Namely, you use sparse BLAS level 2. With GPUs, the best way is to use full BLAS level 3. The advantage is that you can implement jumps and all sorts of multifactor models. Matrices are limited to dimension up to 1K-5K, which is more you may even want in most cases. Another advantage is that there are big benefits in pricing a portfolio all at once as opposed to iterating over positions. Scaling in the position direction is highly non-linear.

CompPDE · October 14th, 2007, 11:06 pm

I have never really looked at GPU but don't GPU's(a) use single precision ?(b) does not conform to IEEE ?so is it "safe" to use for hardcore numerics ? I could see it being used for other things but computational math ? personally due to (a) and (b) I would not go near it, unless you know ways to use it safely ?