Blog

The HDF Group’s HPC Program

Quincey Koziol, The HDF Group

“A supercomputer is a device for turning compute-bound problems into I/O-bound problems.” – Ken Batcher, Prof. Emeritus, Kent State University.

HDF5 began out of a collaboration between the National Center for Supercomputing Applications (NCSA) and the US Department of Energy’s Advanced Simulation and Computing Program (ASC), so high-performance computing (HPC) I/O has been in our focus from the very beginning.  As we are starting our 20th year of development on HDF5, HPC I/O continues to be a critical driver of new features.

Los Alamos National Laboratory is home to two of the world’s most powerful supercomputers, each capable of performing more than 1,000 trillion operations per second. Here, ASC is examining the effects of a one-megaton nuclear energy source detonated on the surface of an asteroid. Image from ASC at http://www.lanl.gov/asci/

The HDF5 development team has focused on three things when serving the HPC community: performance, freedom of choice and ease of use. Our goal has always been to achieve >90% of the underlying layer’s performance (MPI-IO, in this case), so that application developers aren’t giving up significant performance when they choose to use HDF5 instead of rolling their own I/O solution. This level of performance is an ongoing challenge, as HPC systems are constantly evolving, but we work hard to meet the goal. The great variety of capabilities that HDF5 provides can be used directly by applications, or tailored to a particular user community with a targeted I/O library such as netCDF-4, CGNS, MOAB, etc.

Beyond just writing code to improve the HDF5 library, we are actively working with many US Department of Energy (DOE) and National Science Foundation (NSF) supercomputing centers in the US to reach out to HPC application developers and help them use HDF5 more effectively in their code.  As part of this effort, we also keep our eyes open for performance and usability needs in the community, which then drives new features in HDF5, making a virtuous cycle of benefits that spread throughout the HDF5 user community.

In addition to these efforts that benefit today’s applications, the HDF Group is involved in efforts targeted to future needs of the HPC community.  In particular, we are partners with Intel in the Exascale System Storage and I/O (ESSIO) project funded by the DOE.  This project is focused on creating a storage stack that targets the needs of future HPC systems, with greatly improved fault tolerance, asynchronous I/O and support for a deep memory hierarchy that includes volatile and non-volatile memory as well as disk and tape.

This long-term dedication to supporting the needs of our user community continues with the upcoming HDF5 1.10 release. We are planning to incorporate many improvements to parallel HDF5 in this release:

  • Speed up file closing by avoiding file truncation: The HDF5 file format through the 1.8 releases requires that an HDF5 file be explicitly truncated when it is closed. However, truncating a file on a parallel file system can be an exceptionally slow operation. In response, we have changed the HDF5 file format to allow the file’s valid length to be directly encoded when the file is closed, avoiding a painful file truncation operation.
  • Faster I/O with collective metadata read operations: Metadata read operations in parallel HDF5 are independent I/O operations, occurring on each process of the MPI application using HDF5. However, when all processes are performing the same operation, like opening a dataset, this can cause a “read storm” on the parallel file system, with each process generating identical I/O operations. We have addressed this problem by adding a collective I/O property to metadata read operations, such as opening objects, iterating over links, etc. This allows the library to use a single process to read metadata from the HDF5 file and broadcast that information to the other processes, drastically reducing the number of I/O operations that the file system encounters.
  • Collective metadata writes: The HDF5 library currently writes out modified file metadata from all processes in the application, with many independent I/O operations. However, because modifying file metadata is a collective operation, it is possible to perform a single collective I/O to write all the file metadata at once, improving the performance of metadata writes considerably, with no application changes required.
  • Improved performance through multi-dataset I/O: The HDF5 library will allow a dataset access operation to operate on multiple datasets with a single I/O call. These new API routines, H5Dread_multi() and H5Dwrite_multi(), perform a single I/O operation to multiple datasets in the file. These routines can improve performance, especially in cases when data accessed across several datasets from all processes can be aggregated in the MPI-I/O layer. The new functions can be used for both independent and collective I/O access, although they are primarily focused on collective I/O.
  • Page-buffered I/O: Currently, the HDF5 library performs I/O on individual metadata objects, generating a number of small I/O operations at random file locations and causing poor performance, particularly on today’s parallel file systems like GPFS or Lustre. With page-buffered I/O enabled, application developers can indicate a page size and alignment for the library to perform I/O on, allowing many metadata entries to be written out with a single page I/O operation and on a page-aligned boundary that is friendly to the parallel file system, speeding up I/O operations considerably. This feature can also improve performance for serial applications.
  • Cache image writing: The HDF5 library maintains a cache of file metadata entries, holding frequently accessed entries in memory for as long as possible, in order to minimize I/O to the file. While a file remains open, the cache assists greatly with I/O performance, but when a file is closed, the contents of the metadata cache are released, with modified entries written out to their locations in the file, generating a burst of small I/O operations (although those are possibly mitigated by other features, above). Then, when that file is re-opened, all the hot items in the cache must be reloaded, generating more I/O operations. To address this, an image of all the metadata entries in the cache may be written out in a single I/O operation and then reloaded when the file is reopened, eliminating many I/O operations and improving performance greatly. Like page-buffered I/O, this feature can improve serial application performance as well as parallel.

Together with other improvements in the HDF5 1.10 release, we are committed to taking parallel HDF5 and our user community into the exascale realm and beyond!

No Comments

Leave a Comment