Technology Integration

These projects relate to innovations in system architecture that aim to improve performance, reduce variability, and enhance reliability/resilience. They may also involve evaluation of new processor technologies as applicable to the OLCF roadmap.

Spectral

Spectral is a software library for enabling the use of the Summit Burst-Buffer I/O system without need for code modification. Using function call intercept techniques, Spectral hooks into an application and detects when files are written to specially configured directories. Upon detection, spectral schedules and manages the asynchronous drains from the burst-buffer to the parallel file system. Spectral has successfully been tested with Fortran GTC and Lammps on OLCF Power clusters. The use of Spectral in both applications significantly increased checkpointing performance while requiring no modifications to the application source code.

Contributors: Christopher Zimmer and Scott Atchley

Status: On-going

  Top

Functional Partitioning Runtime for Many-Core Nodes

With the leveling off of processor clockspeeds, chip manufacturers have increased the number of cores to consume the additional transitors promised by Moore's Law. As we move towards exascale systems, we may see higher core counts per node including more cores per socket, more sockets, and add-in boards such as GP-GPUs and many-core coprocessors connected via PCI Express. As users initially port their applications to Titan's GP-GPUs, they may not fully utilize the CPUs leaving them available for functional partitioning.

This effort will explore providing runtime services to applications using a small subset of cores on a many-core systems by developing a Functional Partitioning (FP) runtime environment. This environment will partition a many-core node such that an end-to-end application (simulation + data analysis tasks) can be scheduled in-situ, on the same node, alongside the application’s simulation job for better end-to-end performance.

Services provided may include I/O buffering, which would allow the application to resume computation more quickly, or various forms of fine-grain analysis or transformation.

Contributors: Scott Atchley, Saurabh Gupta, Ross Miller, and Sudharshan Vazhkudai

Status: On-going

Selected publications:

  Top

Top and Bottom Scheduling

To minimize the runtime variability of jobs and reduce node allocation fragmentation, we developed a dual-ended scheduling policy for the Moab scheduler on Titan as opposed to the first-fit policy. Our algorithm schedules large jobs top down and small jobs bottom up, thereby minimizing the fragmentation for large, long-running jobs that is caused by small, short-lived jobs. Further, we also modified the scheduling of nodes to a job to prioritize the z dimension (nodes within a rack), followed by the x dimension (nodes in the racks that are in the same row offer better communication bandwidth) and then the y dimension (columns).

Thousands of jobs are benefitting from this technique. This project received a Significant Event Award, a prestigious lab-wide recognition.

Contributors: Chris Zimmer, Scott Atchley, Sudharshan Vazhkudai

Status: Complete

Selected publications:

  Top

Supercomputer Failure Analysis

This project analyzes temporal failure characteristics of Titan’s 299,008 CPUs and 18,688 GPUs to understand trends in machine failure, MTBF, single bit errors, double bit errors, off the bus errors and temperature correlation to failure [TitanGPUReliability:HPCA15]. This study was the first of its kind for a large-scale GPU deployment. Based on the insights, devised checkpointing advisory tools [LazyChkpt:DSN14] that are saving millions of core hours for production jobs, devised an energy-based scheduling algorithm to schedule large node-count GPU jobs that stress the GPUs more to be scheduled at the bottom of the rack where it is much cooler than the top of the rack, or cordoned off nodes with frequent failures

TechInt contributors: Devesh Tiwari, Sudharshan Vazhkudai

Status: Complete

Selected publications:

  Top

Interconnect Congestion Analysis

High-bandwidth file-systems are critical to today’s super-computing applications. To achieve the level of performance necessary for leadership class of applications, the underlying network must facilitate high aggregate-bandwidth demands. Unfortunately, in such a large-scale network, congestion at routers leads to limited overall I/O performance and high variability.

In this effort, our goal was to identify and understand bottlenecks in the interconnection-network pertaining to the file system I/O traffic. This work involved analyzing the impact of job placement, router placement on performance, and studying how these configurations play a role in reducing congestion in the interconnection-network. During the course of this investigation we sought to:

  1. Understand the dynamic behavior of the system via careful experimentation and investigation, and identify where bottlenecks occur,
  2. Develop simulated models of the network for an in-depth study of network dynamics in presence of real-life workload scenarios,
  3. Identify heuristics or layouts that improve upon the existing topology, and
  4. Come up with optimal I/O router layouts for current and several other network topologies for future systems.

As a result of this research, we developed and deployed a lightweight, scalable mechanism to monitor the router traffic on Titan’s interconnect fabric (9600 routers). The tool is used to analyze interference among jobs.

Contributors: Saurabh Gupta, Chris Zimmer, and Sudharshan Vazhkudai

Status: On-going

  Top