Scheduling of supercomputer resources
Job scheduling and resource management plays an essential role in high-performance computing. The resources of a cluster are usually managed by a batch system, which is responsible for the effective mapping of jobs onto resources (i.e., compute nodes in most cases). From the system perspective, a batch system must ensure high system utilization and throughput, while from the user perspective it must ensure fast response times and fairness when allocating resources across jobs. In the recent past, our research in this area focused on dynamic scheduling techniques to (i) respond to sudden changes of a job’s resource requirements at runtime, for example, to accommodate expanding data structures or enable in-situ analyses for visualization and (ii) allow the scheduler to modify the resource set assigned to a job to improve system throughput or to deal with sudden node failures. Currently, we are working on scheduling algorithms for the co-allocation of compute nodes and fast but limited storage resources such as NVMe.
- Leah E. Lackner, Hamid Mohammadi Fard, Felix Wolf: Efficient Job Scheduling for Clusters with Shared Tiered Storage. In Proc. of the 19th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (CCGrid), Larnaca, Cyprus, pages 321-330, IEEE, May 2019. [PDF] [DOI] [BibTeX]
- Suraj Prabhakaran, Marcel Neumann, Felix Wolf: Efficient Fault Tolerance through Dynamic Node Replacement. In Proc. of the 18th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (CCGrid), Washington, DC, USA, May 2018[PDF][BibTeX]
- Suraj Prabhakaran, Marcel Neumann, Sebastian Rinke, Felix Wolf, Abhishek Gupta, Laxmikant V. Kalé: A Batch System with Efficient Scheduling for Malleable and Evolving Applications. In Proc. of the 29th IEEE International Parallel and Distributed Processing Symposium (IPDPS), Hyderabad, India, pages 429-438, IEEE Computer Society, May 2015. [PDF] [URL] [DOI] [BibTeX]