The 118th AICS Cafe
Joint seminar (Large-scale Parallel Numerical Computing Technology Research Team)
Date and Time: Wed. Aug. 2, 2017, 14:00-16:30
Place: Workshop room (6th floor) at AICS
Keita Teranishi (Sandia National Laboratories, California)
Balazs Gerofi (System Software ResearchTeam)
Miwako Tsuji（Programming Environment Research Team）
Presentation Language: English
Presentation Material: English
【14:00-15:00】Keita Teranishi (Sandia National Laboratories, California)
Title: Toward Resilient Asynchronous Many Task Programming
Abstract: As semiconductor technology reaches its physical limit, the performance improvement of high performance computing systems no longer follows the predictions by Moore’s law. One of the viable approaches to address this stagnation is to relax the reliability of computing systems, and leave the application users to manage it. To enable this idea, it is essential for programming model to embrace a resilience capability. Today, the major resilience framework is coordinated checkpoint and restart (C/R), which involves global coordination of processes and threads for accommodating consistent global application state. However, this global recovery model entails inherent scalability issues and disproportionate use of resources to respond to local failures. These issues are better handled through asynchronous many task (AMT) programming model that is intended for deriving good scalability from unprecedented parallelism and complexity of node architecture of future HPC systems. A runtime system with AMT enables abstractions for encapsulating streams of program execution (tasks) and organizing the application data as objects rather than a sequence of data mapped to the system address space. In the AMT model, resilience is achieved through task re-execution and replication facilitated with versioning and replication of data objects. However, extensive research on task-based resilience is still required to determine the roadmap of resilience in the context of the programming environment. We will discuss our ongoing activities on the resilience of high performance AMT programming model and the challenges for scalable HPC application resilience.
【15:30－16:00】Miwako Tsuji（Programming Environment Research Team）
Title: Fault tolerance features in an XMP-YML scientific workflow programming model
Abstract: Supercomputers in the exa-scale era would consist of a huge number of nodes arranged in a multi-level hierarchy. There are many important challenges to exploit such systems such as scalability, programmability, reliability etc... In this talk, we focus on the scalability, programmability and fault tolerance features of a multi SPMD programming model. We have developed a development and execution environment based on workflow and PGAS (Partitioned Global Address Space) . We have extended the environment by incorporating fault resilience scheduling policy into the workflow scheduler.
【16:00-16:30】Balazs Gerofi (System Software ResearchTeam)
Title: IHK/McKernel: A Lightweight Multi-kernel based Operating System for Extreme Scale Supercomputing
Abstract: RIKEN Advanced Institute for Computation Science leads the development of Japan's next generation flagship supercomputer, the successor of the K Computer. Part of this effort is to design and develop a system software stack that suits the needs of future extreme scale computing. In this talk, we focus on operating system research and discuss IHK/McKernel, our multi-kernel based operating system framework. IHK/McKernel runs Linux with a light-weight kernel side-by-side on compute nodes with the primary motivation of providing scalable, consistent performance for large scale HPC simulations, but at the same time to retain a fully Linux compatible execution environment. We present an overview of the system architecture, provide preliminary results on up to two thousand Intel Xeon Phi nodes and outline future research directions.