Focus Areas


We solve diverse technical challenges in OLCF, particularly focused on the following areas.

System Architecture

These projects relate to innovations in system architecture that aim to improve performance, reduce variability, and enhance reliability/resilience. They may also involve evaluation of new processor technologies as applicable to the OLCF roadmap. The Technology Integration Group’s Networking projects focus on data movement at all scales and levels of abstraction – among processors within a supercomputer, among computers across continents, across diverse hardware transport mechanisms using a variety of protocol stacks.

Data Science

OLCF generates a vast amount of system event from world’s largest supercomputers, file systems, and user interactions with such systems. Analyzing and understanding such data artifacts is crucial for preventing system failures, providing uninterrupted service experience to system users, and also developing the future generation supercomputers.

File and Storage Systems

The projects on this page explore extreme scale storage in terms of bandwidth, capacity, performance, reliability, scalability, and usability. Some projects are local and specific to OLCF while others involve collaboration with diverse organizations and individuals to advance the state of the art. These projects exemplify the Technology Integration Group’s contributions to data management and analysis, including data modeling, data capture, publication, search and discovery, analysis, and visualization.

HPSS Archival Storage

High Performance Storage System (HPSS) is the result of over a decade of collaboration among five Department of Energy laboratories and IBM, with significant contributions by universities and other laboratories worldwide.
Technology Integration members develop and maintain components of HPSS responsible for the administrative interfaces, low level data management, and logging.

Projects


Technology Integration Group is leading key R&D activities in OLCF.

The Accelerated Data Analytics and Computing Institute has been established to explore potential future collaboration among UT-Battelle, LLC (UT-Battelle), the Swiss Federal Institute of Technology, Zurich (Eidgenössische Technische Hochschule Zürich/ ETH Zurich), and Tokyo Institute of Technology (Tokyo Tech). Consistent with their respective missions, the Participants seek to collaborate and leverage their respective investments in application software readiness in order to expand the breadth of applications capable of running on accelerated architectures. All three organizations manage HPC centers that run large, GPU-accelerated supercomputers and provide key HPC capabilities to academia, government, and industry to solve many of the world’s most complex and pressing scientific problems.

Scott Atchley

Constellation is a digital object identifier (DOI) based science network for supercomputing data. Constellation makes it possible for OLCF researchers to obtain DOIs for large data collections by tying them together with the associated resources and processes that went into the production of the data (e.g., jobs, collaborators, projects), using a scalable database. It also allows the annotation of the scientific conduct with rich metadata, and enables the cataloging and publishing of the artifacts for open access, aiding in scalable data discovery. OLCF users can use the DOI service to publish datasets even before the publication of the paper, and retain key data even after project expiration. From a center standpoint, DOIs enable the stewardship of data, and better management of the scratch and archival storage.

The Constellation web portal can be accessed at https://doi.ccs.ornl.gov.

Mitch Griffith

Ross Miller

Sly Silviu-Cristian Dumitru

Hyogi Sim

Data Jockey is a workflow aware data management service that helps users automate the orchestration of data movement and placement across multiple storage tiers in a batch-oriented HPC environment like OLCF.

The goal is to offload the complexity of data preparation and lifetime management from our users while having many storage tiers (tape archives, object stores, parallel file system, node local NVRAM and more to come).

Data Jockey achieves this goal as an overarching layer on top of the existing storage infrastructure, exposing an abstract environment for policy driven data movement, placement, and access that unifies data management while hiding the complexity of having an increasing number of storage tiers.

The service is architected as an outsourced and centralized managed control plane service for user specific scientific workflows, that is capable of reaching out and orchestrating heterogeneous user data management resources such as data stores, movers, and interfaces in behalf of the users.

Chris Brumgard

Bing Xie

Sarp Oral

GUIDE is a framework used to collect, federate, and analyze log data from OLCF, and to derive insights into facility operations based on the log data. GUIDE collects system logs and extracts monitoring data at every level of the various OLCF subsystems, and applies a suite of pre-processing tools to make the raw data consumable. The cleansed logs are then ingested and federated into a central, scalable data warehouse, Splunk, that offers storage, indexing, querying, and visualization capabilities. The GUIDE framework further supports a set of tools to analyze these multiple disparate log streams in concert to derive operational insights. The system has been in operation for over two years n the production OLCF environment.

The main user interface to GUIDE is a set of Splunk dashboards available at https://guide.ccs.ornl.gov.

Ross Miller

Chris Zimmer

High Performance Storage System (HPSS) is the result of over two decades of collaboration among five Department of Energy laboratories and IBM, with significant contributions by universities and other laboratories worldwide.

HPSS can manage petabytes of data on disk and robotic tape libraries, providing highly flexible and scalable hierarchical storage management that keeps recently used data on disk and less recently used data on tape. Through the use of cluster, LAN and/or SAN technology, HPSS aggregates the capacity and performance of many computers, disks, and tape drives into a single virtual archival system of exceptional size and versatility. This approach enables HPSS to easily meet otherwise unachievable demands of total storage capacity, file sizes, data rates, and number of objects stored.

Technology Integration members develop and maintain components of HPSS responsible for the administrative interfaces, low level data management, and logging.

Vicky White

Mitch Griffith

Adam Disney

Spider PFS Metadata Snapshot Capture and Analysis:
  1. LustreDU: The development a scalable tool to capture daily snapshots of the 1-Billion entry Spider PFS’s metadata. The snapshots contain valuable information such as file paths, last modification and access times, owner, and group information. The snapshots have been captured the past three years (each snapshot is ~ 119 GB).

  2. Snapshot Analysis: If the daily metadata snapshots can be analyzed in aggregate, it can provide deep insights into the temporal evolution of how the PFS is used by the various science projects. To this end, (a) created a Spark-based distributed analysis framework to analyze the ~ 127 TB of data, and (b) analyzed the data to study the temporal evolution of the file system, glean insights into project/user behavior and their file characteristics, and understand the sharing between users/projects. The analysis was conducted across multiple dimensions such as: (1) How projects use the file system (directory depth, files/directory, wide-striping, burstiness), (2) File characteristics (Popular file types, age), (3) Behavior analysis (e.g., how long after creation are files still accessed?), and (4) Sharing analysis of projects and users (does it follow a power-law graph, small-world pattern, etc., what is the diameter of the connected components?). The analysis has already provided rich information for the design of future storage systems.

PFS Tools:

Highly parallel tools development for standard file system operations. These tools are being used to operate on petabytes of data and 1 Billion files, and deployed at other sites.

I/O Signature Extraction:

Automatic extraction of application I/O signatures from noisy Spider storage backend I/O logs, using a rich suite of statistical techniques. This work is novel in that it does not require application instrumentation to obtain the signatures unlike extant solutions.

Ross Miller

Hyogi Sim

This seminar series focuses on I/O and Data Sciences and Engineering research and development activities across the DOE National Laboratories, Academia, and Industry. A series of prominent presenters help exchange of ideas and culminate new research and development efforts and collaboration in the field.

The list of talks is available here.

Sarp Oral

Spectral

Spectral is a transparently applied library for taking advantage of the Summit Burst Buffer architecture. Applications using per-process output simply write to the node-local burst buffer. Upon closing the file, spectral automatically schedules the output to drain to the parallel filesystem, this requires no intervention from the application. This drain takes advantage of new technology in Summit nodes that enable draining to the PFS to occur silently while the application is making scientific progress.

SymphonyFS

SymphonyFS is a FUSE-based client that extends local write-caching to use a node-local NVM device. SymphonyFS provides the performance benefits of writing to local NVM while giving the application the appearance of writing to the parallel file system (PFS). The FUSE daemon then drains the data to the PFS in the background. SymphonyFS is suitable for file-per-process and shared file uses.

BB API

The IBM Burst Buffer API is a C/C++ compatible library co-designed by IBM, LLNL, and ORNL for taking advantage of Summit’s Burst Buffer Architecture. The API enables the scheduling of data movement off of the Burst Buffer and into the GPFS parallel file-system. Utilizing, NVMe-Over-Fabrics, the server components of BBAPI are able to transparently drain the data off of the Burst Buffer without interfering with the running application.

BSCFS

The IBM Burst Buffer Shared Checkpoint File-System is an IBM, LLNL, and ORNL co-design project, seeking to close the gap for shared file checkpointing. When utilized, a new BSCFS mount point will be created on the Summit compute nodes. Applications writing shared files from several nodes will be able to use this mount point for their files and data will be written through the node-local burst buffer and asynchronously persisted to the GPFS parallel file system. Taking advantage of the BSCFS software requires small application modifications to the I/O portions of the software, however, large applications taking advantage of BSCFS should see significant improvements in their I/O performance.

NVMalloc

NVMalloc is a library that exposes node-local NVM as byte-addressable memory. It provides a simple, familiar API that is similar to malloc() to the application and hides the complexity of using mmapped files. NVMalloc can extend the available memory to an application beyond system memory, especially for read-heavy data.

Chris Zimmer

Chris Brumgard

Scott Atchley

Sarp Oral

James Simmons

Ross Miller

Hyogi Sim

Bing Xie

The project involves analysis of the reliability characteristics of Titan’s 299,008 CPUs and 18,688 GPUs to understand trends in machine failure, MTBF, single bit errors, double bit errors, off the bus errors and temperature correlation to failure. The study was the first of its kind for a large-scale GPU deployment. Understanding the reliability characteristics of the system is critical to efficient system operations as well as the acquisition of future systems. Below are some key outcomes of this effort:

  1. Checkpoint advisory tool: Based on the insights gained by the detailed reliability analysis, we have devised a checkpoint advisory tool for applications. Having up-to-date MTBF for a production machine can advise users on the optimal frequency to write output or checkpoints based on the portion of the system the job uses and the time to write the output. We have developed a tool to this end, and are advising applications on optimal checkpoint frequency. The tool has the potential to save millions of core hours for applications.

  2. Our ongoing reliability analysis has resulted in tangible feedback to vendors on their products, which has helped to shape future roadmaps.

  3. Real-time monitoring can alert operators and management on the changing state of the system, while post-mortem analysis can point to possible causes and historical trends.

  4. Correlation of machine failure logs and job logs can allow us glean insights into job productivity, and how different size jobs are affected by node failure, which allows center to deploy remedial measures.

Scott Atchley

Ross Miller

Resource selection can have profound impacts on the performance and reliability of applications running on the supercomputer. On Titan, there are on going efforts to continually improve scheduling to best meet the needs of our users.

Past Projects: Dual-ended scheduling in effect (July 2015), reduces fragmentation on Titan by demarcating scheduling between leadership class applications and smaller applications. Balanced ALPS ordering seeks to improve upon the Cray ALPS to optimize resource ordering for leadership class jobs.

Current Projects: GPU Select Scheduling employs both Dual-ended scheduling and ALPS reordering to enable the scheduler to improve the reliability rates of leadership applications through selective placement.

Scott Atchley

Chris Zimmer

The Spider Lustre-based Parallel File System Development and Deployment: The OLCF has deployed multiple large-scale parallel file systems (PFS) to support its operations. During this process, OLCF acquired significant expertise in large-scale storage system design, file system software development, technology evaluation, benchmarking, procurement, deployment, and operational practices. Based on the lessons learned from each new PFS deployment, OLCF improved its operating procedures, and strategies.

A model-driven provisioning tool to assist storage system designers and administrators reconcile key figures of merit (cost, capacity, performance, disk size, rebuild times, redundancy), answer what-if scenarios, and determine the relative importance of spare parts in minimizing data unavailability, both during initial system provisioning and continuous operations. Validated the tool with field observed failure data from two years of Spider operations.

James Simmons

Sarp Oral

Ross Miller

Rick Mohr

Anjus George