UPDATE January 19, 2016: The HDF5-1.10.0-alpha1 release is now available, adding Collective Metadata I/O to these features:
– Concurrent Access to an HDF5 File: Single Writer / Multiple Reader (SWMR)
– Virtual Dataset (VDS)
– Scalable Chunk Indexing
– Persistent Free File Space Tracking
We’re pleased to announce the release of HDF5 1.10.0-alpha0.
HDF5 1.10.0, planned for release in Spring, 2016, is a major release containing many new features. On January 6, 2016 we announced the release of the first alpha version of the software.
The alpha0 release contains some (but not all) of the features that will be in HDF5 1.10.0. The Single Writer/Multiple Reader and Virtual Data Set features, below, are both contained in this alpha release as are scalable chunk indexing and persistent free file space tracking. More features, such as enhancements to parallel HDF5 and support for compressing contiguous datasets will be added in upcoming alpha releases.
In my previous blog post, I discussed the need for parallel I/O and a few paradigms for doing parallel I/O from applications. HDF5 is an I/O middleware library that supports (or will support in the near future) most of the I/O paradigms we talked about.
In this blog post I will discuss how to use HDF5 to implement some of the parallel I/O methods and some of the ongoing research to support new I/O paradigms. I will not discuss pros and cons of each method since we discussed those in the previous blog post.
But before getting on with how HDF5 supports parallel I/O, let’s address a question that comes up often, which is,
“Why do I need Parallel HDF5 when the MPI standard already provides an interface for doing I/O?”
“Any software used in the computational sciences needs to excel in the area of high performance computing (HPC).”
The Computational Fluid Dynamics (CFD) General Notation System (CGNS) is an effort to standardize CFD input and output data, including grid (both structured and unstructured), flow solution, connectivity, boundary conditions, and auxiliary information. It provides a general, portable, and extensible standard for the storage and retrieval of CFD analysis data. The system consists of two parts: (1) a standard format for recording the data, and (2) software that reads, writes, and modifies data in that format. Continue reading →
UPDATE Wednesday, March 23, 2016: The HDF5-1.10.0-pre2 release is now available, featuring:
– Concurrent Access to an HDF5 File: Single Writer / Multiple Reader (SWMR) – Virtual Dataset (VDS) – Scalable Chunk Indexing – Persistent Free Filespace Tracking – Collective Metadata I/O – Integration of Java HDF5 JNI into HDF5 – Many changes have been made to the HDF5 configuration –Unfortunately, parallel HDF5 enhancement has been postponed
This version contains a fix for an issue which occurred when building HDF5 within the source code directory.
The HDF Group is committed to meeting our users’ needs and expectations for managing data in today’s fast evolving computational environment. We are pleased to report that the upcoming major new release of HDF5 (HDF5 1.10.0) will have new capabilities that address important data challenges faced by our community. In this blog we introduce you to some of these exciting new features and capabilities.
More powerful than ever before and packed with new features, the release is scheduled for March, 2016. Among many enhancements, HDF5 1.10.0 addresses:
If you have encountered challenges in any of these areas, then we are certain that the upcoming HDF5 1.10.0 will be of interest to you. Continue reading →
What costs applications a lot of time and resources rather than doing actual computation? Slow I/O. It is well known that I/O subsystems are very slow compared to other parts of a computing system. Applications use I/O to store simulation output for future use by analysis applications, to checkpoint application memory to guard against system failure, to exercise out-of-core techniques for data that does not fit in a processor’s memory, and so on. I/O middleware libraries, such as HDF5, provide application users with a rich interface for I/O access to organize their data and store it efficiently. A lot of effort is invested by such I/O libraries to reduce or completely hide the cost of I/O from applications.
Parallel I/O is one technique used to access data on disk simultaneously from different application processes to maximize bandwidth and speed things up. There are several ways to do parallel I/O, and I will highlight the most popular methods that are in use today.
First, to leverage parallel I/O, it is very important that you have a parallel file system; Continue reading →
Perhaps the original producers of “big data,” the oil & gas (O&G) industryheld its eighth annualHigh-Performance Computing (HPC) workshop in early March. Hosted by Rice University, the workshop brings in attendees from both the HPC and petroleum industries. Jan Odegard, the workshop organizer, invited me to the workshop to give a tutorial and short update on HDF5.
The workshop (#oghpc) has grown a great deal during the last few years and now has more than 500 people attending, with preliminary attendance numbers for this year’s workshop over 575 people (even in a “down” year for the industry). In fact, Jan’s pushing it to a “conference” next year, saying, “any workshop with more attendees than Congress is really a conference.” But it’s still a small enough crowd and venue that most people know each other well, both on the Oil & Gas and HPC sides.
The workshop program had two main tracks, one on HPC-oriented technologies that support the industry, and one on oil & gas technologies and how they can leverage HPC. The HPC track is interesting, but mostly “practical” and not research-oriented, unlike, for example, the SC technical track. The oil & gas track seems more research-focused, in ways that can enable the industry to be more productive.
The first version of HDF was implemented the following spring. Over the next 10 years HDF enjoyed widespread interest and adoption for managing scientific and engineering data. The NASA Earth Observing System (EOS) was an early adopter of HDF. NASA provided much of the funding and technical requirements that made HDF a robust technology, able to support mission-critical applications.
By 1996 it became clear that HDF was not going to adequately address the demands of the next generation of data volumes and computing systems, and in 1998 a second version, called HDF5, was implemented. HDF5 was more scalable than the original HDF (now called HDF4), and had many other improvements. The Department of Energy’s Sandia, Los Alamos, and Lawrence Livermore National Laboratories provided the core funding, technical requirements, and many of the people that made the new format possible. HDF5 quickly replaced HDF4 in popularity, and spread even more rapidly.
In the late 1990s and early 2000s the HDF Group faced increasing demands to ensure that HDF was robust, that HDF5 kept up with advancing technologies and data demands, and that we offer high quality professional support for HDF users. It soon became clear that the HDF Group could best serve these demands by striking out on its own, as an entity separate from the University and NCSA, who had nurtured us so well for 18 years. In January 2005, The HDF Group was incorporated as a not-for-profit company. In July 2006, twelve of us set up shop in the University of Illinois Research Park, and we got ourselves a logo:
Our initial funding came from a financial company that had adopted HDF5 to help gather and manage multiple high speed, high volume market data feeds. We provided them with support and a number of new capabilities in HDF5. The NASA EOS soon joined with contracts for the new company, as did two of the three DOE Labs. The HDF Group chose to be a non-profit because we had a public mission, and we wanted to feel confident that the company would not be diverted from that mission for reasons of financial gain.
The HDF Group’s mission is:
To provide high quality software for managing large complex data, to provide outstanding services for users of these technologies, and to insure effective management of data throughout the data life cycle.
The mission has two goals:
1. To create, maintain, and evolve software and services that enable society to manage large complex data at every stage of the data life cycle. 2. To establish and maintain a sustainable organization with a highly-skilled and committed team devoted to accomplishing the first goal.
The rest is details. We’ll be getting into those details in future blog posts, and we’re hoping some of you will contribute.
Meanwhile, send your comments and questions. We’d love to hear from you. Subscribe to our blog posts on the sidebar. And if you’d like to do a post, please send an email to firstname.lastname@example.org.