Name: Programming Models, Libraries and Tools: Fine Grain MPI, Earl Dodd, University of British Columbia
Start: 2014-03-06T14:20:00-0800
End: 2014-03-06T14:40:00-0800

Back To Schedule

Programming Models, Libraries and Tools: Fine Grain MPI, Earl Dodd, University of British Columbia

DOWNLOAD PRESENTATION

WATCH VIDEO

A major challenge in today’s high performance systems is how to seamlessly bridge between the fine-grain multicore processing inside one processing node to the parallelism available across the nodes in a cluster. In many cases this has led to a hybrid programming approach that has combined Message Passing Interface (MPI) with a finer grain programming model like OpenMP. However, the hybrid approach requires supporting both programming models, creates an inflexible boundary between the parts of the program using one model versus the other, and can create runtime systems models conflicts. We present a system called Fine-grain MPI (FG-MPI) that bridges the gap between multicore and cluster nodes by extending the MPI middleware to support a finer-grain process model that can support large number of concurrent threads in addition to multiple processes across the nodes. This provides a single unified process model that can both scale up and scale out without programming changes or rebuilds. FG-MPI extends the MPICH2 runtime to support execution of multiple MPI processes inside each single OS-process, essentially decoupling an MPI process from that of an OS-level process. These processes are full-fledged MPI processes. It is possible in FG-MPI to have hundreds and even thousands of MPI processes inside a single OS-process. As a result one can develop and execute MPI programs that scale to thousands and millions of MPI processes without requiring the corresponding number of processor cores. FG-MPI supports function-level parallelism, where a MPI process is bound to a function rather than a program, which brings MPI closer to that of task-oriented languages. Expressing function-level concurrency makes it easier to match the parallelism to the problem rather than to the hardware architecture. Overheads associated with the extra message-passing and scheduling of these smaller units of parallelism have been minimized. Context switching among co-located MPI processes in user space is an order of magnitude faster than that of OS-level processes. There is also support for zero-copy communication among co-located MPI processes inside the single address space. The FG-MPI runtime is integrated into the MPICH2 middleware and the co-located MPI processes share key structures inside the middleware and cooperatively progress messages for each other. FG-MPI implements a MPI-aware user-level scheduler that works in concert with MPICH2’s progress engine and is responsive to events occurring inside the middleware. For communication efficiency, we exploit the locality of MPI processes in the system and implement optimized communication between co-located processes in the same OS-process. FG-MPI can be viewed as a type of over-subscription (in the case of SPMD), however, it is the runtime scheduler that manages this over-subscription and not the OS-scheduler. Scheduling of heavy-weight MPI processes by the OS introduces a number of overheads due to costly context switches and because the OS scheduler is not aware of the cooperative nature of the communicating processes. In FG-MPI, not only are context switches an order of magnitude cheaper, but it is possible to reduce OS jitter by matching the number of OS-processes to the processor cores and having the remaining processes scheduled inside the OS-process. Cooperative execution of multiple MPI processes within an OS-process adds slackness that is important for latency hiding. This helps to reduce the idle time that can result from busy polling of the network inside the middleware. FG-MPI can improve the performance of existing MPI programs. Added concurrency makes it possible to adjust the unit of computation to better match the hardware architecture cache size. It can also aid in pipelining of several smaller messages and avoiding the rendezvous protocol commonly used for large messages. The ability to specify finer-grain smaller task-oriented units of computation makes it possible to assign them in many different ways to achieve better load balancing. The support for finer-grain tasking makes it possible to view MPI as a library for supporting concurrent programming rather simply a communication library for moving data among clusters of nodes. Function-level parallelism is closer to the notion of parallelism defined by process-oriented programming or languages based around Actor-like systems. In conclusion, FG-MPI provides a better match for today’s multicore processors and can be used for task-oriented programming with the ability to execute on a single machine (i.e., node) or a cluster of nodes. FG-MPI provides a single programming model that can execute within a single multicore node and across multiple multicore nodes in a cluster.

Moderators

Amik St-Cyr

Senior researcher, Shell

Amik St-Cyr recently joined the Royal Dutch Shell company as a senior researcher in computation & modeling. Amik came to the industry from the NSF funded National Center for Atmospheric Research (NCAR). His work consisted in the discovery of novel numerical methods for geophysical... Read More →

Speakers