In addition to hosting the world's fastest supercomputer, ORNL also operates the world's brightest neutron source, the Spallation Neutron Source (SNS). Funded by the US DOE Office of Basic Energy Science, this national user facility hosts hundreds of scientists from around the world, providing a platform to enable break-through research in materials science, sustainable energy, and basic science. OCLF personnel have been engaged to help manage and analyze the large data sets (ranging in size from 100's of gigabytes to over 1 terabyte) generated by the intense pulses of neutrons.
OLCF staff and SNS data specialists collaborated to successfully complete the Accelerating Data Acquisition, Reduction, and Analysis (ADARA) Lab-Directed Research and Development project to improve the production and analysis of these data sets. OLCF provided its expertise in high-performance file systems, parallel processing, cluster configuration and management, and data management to the project. As a result of the ADARA project, a new data infrastructure was created that enhances users’ ability to collect, reduce, and analyze data as it is taken; create data files immediately after acquisition, regardless of size; reduce a data set in seconds after acquisition; and provide the resources for any user to do post-acquisition reduction, analysis, visualization, and modeling without requiring users to be on-site at the SNS facility.
ADARA is currently running on the HYSPEC beam line providing near real-time access to result data sets (both raw event data and reduced data) so that instrument scientists and users now obtain live feedback from their experiments. Moving forward, ADARA will be deployed in production across a number of beam lines at SNS as the capabilities developed within ADARA continue to be adopted by the SNS facility.
More complete details on ADARA are available at http://www.csm.ornl.gov/newsite/adara.html.
OLCF Contributors: Feiyi Wang, Dale Stansberry, and Ross Miller
Constellation federates metadata from the OLCF resource fabric (stat metadata from ~1 Billion files from Spider PFS/HPSS, millions of jobs metadata from the scheduler, thousands of users/groups, publications and systems), and captures them in a custom-built in-memory graph. It builds links or associations among resources (vertices in the graph) by correlating the metadata to infer hidden relationships (e.g., linking data to jobs, extracting keywords from publications to link together relevant publications or publications, jobs and data). Graph traversals and high-performance indexes external to the graph enable searches.
The stat metadata index is built using Hbase and Spark queries on PFS stat/job metadata. We also build an hierarchical index by extracting metadata from within the datasets themselves, and create more metadata from base metadata. Based on this graph engine, we can discover relationships, suggest related data products/papers of interest, identify popular datasets via pagerank, and study and create new user-specified “tags” that tie together resources for quick retrieval and sharing. [ConstellationGraph:BigData16]
The first workflow to be supported in Constellation is the acquisition of Digital Object Identifiers for scientific datasets (see "DOIs for HPC Data" below).
Contributors: Sudharshan Vazhkudai, Raghul Gunasekaran, Dale Stansberry, Tom Barron
Recent directives from federal agencies outline a new desire and requirement to provide access to scientific data arising from taxpayer-funded research. The provision of this data will require new policies and procedures including a much-improved mechanism for dataset identification and tracking.
To this end, we are exploring the viability of digital object identifiers (DOI) as a means to track data products emanating from scientific simulations. DOI is a mechanism that can be used to help track, identify, and share the data sets that are produced by researchers globally.
The ability to facilitate data-related services via DOIs has new uses for both the HPC center and the end-user. The center could utilize DOIs in their interactions with funding agencies by providing improved accounting and visibility of our user facility's data production. The center can also directly benefit from new "data strategies" such as data warehousing and other beneficial schemes that result from the improved planning information associated with DOIs and Data Management Planning. From a user standpoint, DOIs help facilitate data sharing, enable publication credit, enable data preservation beyond the lifetime of the project at the center, facilitate lineage tracking, and facilitate the use of intermediate data products.
We are working on creating a workflow so that users can obtain DOIs for their data products of interest. We are working with OSTI in trying to create this infrastructure.
Contributors: Sudharshan Vazhkudai, Raghul Gunasekaran, Dale Stansberry, Mitchell Griffith, Tom Barron
GUIDE is a scalable Splunk-based infrastructure which aggregates and supports analysis of huge amounts of log data, presenting a window into HPC data center operations. Logs include (i) Spider PFS: I/O bandwidth data from controllers, Lustre-level profiling data, file size distribution of the 1 Billion files and health logs from 2016 OSTs comprising of 20,160 disks, (ii) Titan CPU/GPU RAS data from 18,688 nodes, (iii) Moab job scheduler and node allocation data, (iv) interconnect congestion data from the 9600 routers, and (v) HPSS archival storage usage and file size distribution data from 61 million files. GUIDE provides higher-level services using a variety of analytics techniques such as the correlation of logs, data mining, statistical and visual analytics. These can be used to identify hotspots, debug performance bottlenecks and trend analysis for future resource provisioning.
Contributors: Sudharshan Vazhkudai, Ross Miller, Deryl Steinert, Chris Zimmer, Feiyi Wang
Large-scale scientific applications' usage patterns lead to I/O resource contention and load imbalance. This project implemented a dynamic, shared library based on BPIO, a method to resolve contention, provides a transparent way to balance resource usage without source code modification or recompilation.
The BPIO Runtime Environment can be built as a shared, preloadable library. It utilizes the BPIO Library for balanced data placement and function interposition to prioritize itself over standard function calls. This provides end-to-end and per job load balancing supporting a range of I/O interfaces including POSIX and MPI-IO. HDF5 is under development.
Aequilibro integrates BPIO with ADIOS, combining BPIO's interconnect level optimization with the benefits of the ADIOS I/O framework to provide portable, fast, scalable, easy-to-use, and metadata-rich output and I/O interfaces that can be changed during runtime.
TechInt contributors: Sarah Neuwrith, Feiyi Wang, Sarp Oral, Sudharshan Vazhkudai
TagIt is a novel data management service framework supporting annotation, tagging, indexing, and filtering operations on files. These services are tightly integrated into a shared-nothing GlusterFS distributed file system. TagIt manages a scalable and consistent metadata index database (MySQL) inside the file system, exploiting readily available resources. It enables advanced tagging that allows users to mark entire collections of files down to specific portions of a file. It enables the association of operators to a tag for pre-processing, filtering, or automatic metadata extraction which are seamlessly offloaded to file servers in a load-aware fashion.
TechInt contributors: Sudharshan Vazhkudai, Hyogi Sim