Spectral is a software library for enabling the use of the Summit Burst-Buffer I/O system without need for code modification. Using function call intercept techniques, Spectral hooks into an application and detects when files are written to specially configured directories. Upon detection, spectral schedules and manages the asynchronous drains from the burst-buffer to the parallel file system. Spectral has successfully been tested with Fortran GTC and Lammps on OLCF Power clusters. The use of Spectral in both applications significantly increased checkpointing performance while requiring no modifications to the application source code.
Contributors: Christopher Zimmer and Scott Atchley
With the leveling off of processor clockspeeds, chip manufacturers have increased the number of cores to consume the additional transitors promised by Moore's Law. As we move towards exascale systems, we may see higher core counts per node including more cores per socket, more sockets, and add-in boards such as GP-GPUs and many-core coprocessors connected via PCI Express. As users initially port their applications to Titan's GP-GPUs, they may not fully utilize the CPUs leaving them available for functional partitioning.
This effort will explore providing runtime services to applications using a small subset of cores on a many-core systems by developing a Functional Partitioning (FP) runtime environment. This environment will partition a many-core node such that an end-to-end application (simulation + data analysis tasks) can be scheduled in-situ, on the same node, alongside the application’s simulation job for better end-to-end performance.
Services provided may include I/O buffering, which would allow the application to resume computation more quickly, or various forms of fine-grain analysis or transformation.
Contributors: Scott Atchley, Saurabh Gupta, Ross Miller, and Sudharshan Vazhkudai
To minimize the runtime variability of jobs and reduce node allocation fragmentation, we developed a dual-ended scheduling policy for the Moab scheduler on Titan as opposed to the first-fit policy. Our algorithm schedules large jobs top down and small jobs bottom up, thereby minimizing the fragmentation for large, long-running jobs that is caused by small, short-lived jobs. Further, we also modified the scheduling of nodes to a job to prioritize the z dimension (nodes within a rack), followed by the x dimension (nodes in the racks that are in the same row offer better communication bandwidth) and then the y dimension (columns).
Thousands of jobs are benefitting from this technique. This project received a Significant Event Award, a prestigious lab-wide recognition.
Contributors: Chris Zimmer, Scott Atchley, Sudharshan Vazhkudai
This project analyzes temporal failure characteristics of Titan’s 299,008 CPUs and 18,688 GPUs to understand trends in machine failure, MTBF, single bit errors, double bit errors, off the bus errors and temperature correlation to failure [TitanGPUReliability:HPCA15]. This study was the first of its kind for a large-scale GPU deployment. Based on the insights, devised checkpointing advisory tools [LazyChkpt:DSN14] that are saving millions of core hours for production jobs, devised an energy-based scheduling algorithm to schedule large node-count GPU jobs that stress the GPUs more to be scheduled at the bottom of the rack where it is much cooler than the top of the rack, or cordoned off nodes with frequent failures
TechInt contributors: Devesh Tiwari, Sudharshan Vazhkudai
High-bandwidth file-systems are critical to today’s super-computing applications. To achieve the level of performance necessary for leadership class of applications, the underlying network must facilitate high aggregate-bandwidth demands. Unfortunately, in such a large-scale network, congestion at routers leads to limited overall I/O performance and high variability.
In this effort, our goal was to identify and understand bottlenecks in the interconnection-network pertaining to the file system I/O traffic. This work involved analyzing the impact of job placement, router placement on performance, and studying how these configurations play a role in reducing congestion in the interconnection-network. During the course of this investigation we sought to:
As a result of this research, we developed and deployed a lightweight, scalable mechanism to monitor the router traffic on Titan’s interconnect fabric (9600 routers). The tool is used to analyze interference among jobs.
Contributors: Saurabh Gupta, Chris Zimmer, and Sudharshan Vazhkudai