Dynamic Management of Supercomputer Resources
Job scheduling and resource management plays an essential role in high-performance computing. The resources of a supercomputer are usually managed by a batch system, which is responsible for the effective mapping of jobs onto resources (i.e., compute nodes in most cases). From the system perspective, a batch system must ensure high system utilization and throughput, while from the user perspective it must ensure fast response times and fairness when allocating resources across jobs. Traditional batch systems perform only static resource management, that is, they only support jobs with fixed resource requirements over their entire life cycle. However, this is not sufficient as HPC applications now often exhibit unpredictably changing resource requirements, for example, to accommodate expanding data structures or enable in-situ analyses for visualization. In general, changing requirements may refer to a wide range of resource types, including compute nodes and different classes of storage. Moreover, runtime systems are becoming more adaptive by nature to lower energy consumption and support fault tolerance. To allow the system to modify the resource set assigned to a job immediately when the need arises, we develop dynamic resource management techniques for batch systems, which we believe will be indispensable in the future.
- Suraj Prabhakaran, Marcel Neumann, Felix Wolf: Efficient Fault Tolerance through Dynamic Node Replacement. In Proc. of the 18th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (CCGrid), Washington, DC, USA, May 2018[PDF][BibTeX]
- Suraj Prabhakaran, Marcel Neumann, Sebastian Rinke, Felix Wolf, Abhishek Gupta, Laxmikant V. Kalé: A Batch System with Efficient Scheduling for Malleable and Evolving Applications. In Proc. of the 29th IEEE International Parallel and Distributed Processing Symposium (IPDPS), Hyderabad, India, pages 429-438, IEEE Computer Society, May 2015. [PDF] [URL] [DOI] [BibTeX]