|Name:||Tutorial 4: Performance-Oriented Programming on Multicore-Based Clusters with MPI, OpenMP & Hybrid MPI/OpenMP|
|Time:||Sunday, June 17, 2012
9:00 AM - 1:00 PM
Hamburg University, East Wing
|Speakers:||Georg Hager, Erlangen Regional Computing Center|
|Gabriele Jost, AMD|
|Rolf Rabenseifner, HLRS Stuttgart|
|Jan Treibig, Erlangen Regional Computing Center|
|Abstract:||Most HPC systems are clusters of multicore, multisocket nodes. These systems are highly hierarchical, and there are several possible programming models; the most popular ones being shared memory parallel programming with OpenMP within a node, distributed memory parallel programming with MPI across the cores of the cluster, or a combination of both. Obtaining good performance for all of those models requires considerable knowledge about the system architecture and the requirements of the application. The goal of this tutorial is to provide insights about performance limitations and guidelines for program optimization techniques on all levels of the hierarchy when using pure MPI, pure OpenMP, or a combination of both.
We cover peculiarities like shared vs. separate caches, bandwidth bottlenecks, and ccNUMA locality. Typical performance features like synchronization overhead, intranode MPI bandwidths and latencies, ccNUMA locality, and bandwidth saturation (in cache and memory) are discussed in order to pinpoint the influence of system topology and thread affinity on the performance of parallel programming constructs. Techniques and tools for establishing process/thread placement and measuring performance metrics are demonstrated in detail. We also analyze the strengths and weaknesses of various hybrid MPI/OpenMP programming strategies. Benchmark results and case studies on several platforms are presented.