HDF5 as a zero-configuration, ad-hoc scientific database for Python

Andrew Collette, Research Scientist with IMPACT, HDF Guest Blogger

“…HDF5 is that rare product which excels in two fields: archiving and sharing data according to strict standardized conventions, and also ad-hoc, highly flexible and iterative use for local data analysis. For more information on using Python together with HDF5…”

Accelerator Lab at IMPACT image used with permission
IMPACT breaks the 100 km/s speed barrier in February, 2015.  The Dust Accelerator Lab detected their fastest dust grain to date. An iron grain with a charge of 0.2 fC and diameter of 30 nm was clocked at a speed of 107.6 km/s (or 240,694 mph).  Image used with permission from IMPACT.

An enormous amount of effort has gone into the HDF ecosystem over the past decade. Because of a concerted effort between The HDF Group, standards bodies, and analysis software vendors, HDF5 is one of the best technologies on the planet for sharing numerical data. Not only is the format itself platform-independent, but nearly every analysis platform in common use can read HDF5. This investment continues with tools like HDF Product Designer and the REST-based H5Serv project, for sharing data using the HDF5 object model over the Internet.

What I’d like to talk about today is something very different: the way that I and many others in the Python world use HDF5, not for widely-shared data but for data that may never even leave the local disk…   Continue reading

HDF at the 2015 Oil & Gas High Performance Computing Workshop

Quincey Koziol, The HDF Group

photo from NASA.gov

Perhaps the original producers of “big data,” the oil & gas (O&G) industry held its eighth annual High-Performance Computing (HPC) workshop in early March.    Hosted by Rice University, the workshop brings in attendees from both the HPC and petroleum industries.  Jan Odegard, the workshop organizer, invited me to the workshop to give a tutorial and short update on HDF5.

2015-03-18 09_08_46-▶ Rice 2014 Oil & Gas High Performance Computing Workshop - YouTube snapshot
Rice University hosts 2015 O & G HPC Workshop

The workshop (#oghpc) has grown a great deal during the last few years and now has more than 500 people attending, with preliminary attendance numbers for this year’s workshop over 575 people (even in a “down” year for the industry).  In fact, Jan’s pushing it to a “conference” next year, saying, “any workshop with more attendees than Congress is really a conference.” But it’s still a small enough crowd and venue that most people know each other well, both on the Oil & Gas and HPC sides.

The workshop program had two main tracks, one on HPC-oriented technologies that support the industry, and one on oil & gas technologies and how they can leverage HPC.  The HPC track is interesting, but mostly “practical” and not research-oriented, unlike, for example, the SC technical track. The oil & gas track seems more research-focused, in ways that can enable the industry to be more productive.

I gave an hour and a half tutorial on developing and tuning parallel HDF5 applications, which Continue reading

From HDF5 Datasets to Apache Spark RDDs

Gerd Heber, The HDF Group

“I would like to do something with all the datasets in all the HDF5 files in this directory, but I’m not sure how to proceed.”

If this sounds all too familiar, then reading this article might be worth your while. The accepted general answer is to write a Python script (and use h5py [1]), but I am not going to repeat here what you know already. Instead, I will show you how to hot-wire one of the new shiny engines, Apache Spark [2], and make a few suggestions on how to reduce the coding on your part while opening the door to new opportunities.

But what about Hadoop? There is no out-of-the-box interoperability between HDF5 and Hadoop. See our BigHDF FAQs [3] for a few glimmers of hope. Major points of contention remain such as HDFS’s “blocked” worldview and its aversion to relatively small objects, and then there is HDF5’s determination to keep its smarts away from prying eyes. Spark is more relaxed and works happily with HDFS, Amazon S3, and, yes, a local file system or NFS. More importantly, with its Resilient Distributed Datasets (RDD) [4] it raises the level of abstraction and overcomes several Hadoop/MapReduce shortcomings when dealing with iterative methods. See reference [5] for an in-depth discussion.

Figure 1.  A simple HDF5/Spark scenario
Figure 1. A simple HDF5/Spark scenario

As our model problem (see Figure 1), consider the following scenario: Continue reading

Welcome to our blog

Welcome to the new HDF Group Blog.

We are excited to introduce a blog series to share knowledge about HDF.  The blog will include information about HDF technologies, uses of HDF, plans for HDF, our company and its mission, and anything else that might be of interest to HDF users and others who could enjoy the benefits of HDF.

Our staff will post regularly on the blog. We also welcome guest blogs from the community.  If you’d like to do a post, please send an email to blog@hdfgroup.org.

We hope you will comment on blog posts and on the comments of others. Comments are moderated. We will review them and post them as quickly as possible.

The HDF blog does not replace our usual modes of communicating. We will continue to rely on the HDF website, the HDF forum, the HDF helpdesk, newsletters, bulletins, and Twitter.

Welcome, again, to the HDF Group Blog.  Let this be the beginning of a lively and informative dialogue.

Mike Folk
The HDF Group

We’d love to hear from you.  What do you want us to write about?  Let us know by commenting!

The HDF Group – who we are

We thought it would be good to kick off the HDF Blog series with a short explanation of who we are and why we exist.

The HDF Group started in 1987 at the National Center for Supercomputing Applications (NCSA) at the University of Illinois in Urbana, Illinois. Here’s an email from the first meeting of the group:

Minutes from the first HDF Group meeting

The first version of HDF was implemented the following spring. Over the next 10 years HDF enjoyed widespread interest and adoption for managing scientific and engineering data. The NASA Earth Observing System (EOS) was an early adopter of HDF. NASA provided much of the funding and technical requirements that made HDF a robust technology, able to support mission-critical applications.
By 1996 it became clear that HDF was not going to adequately address the demands of the next generation of data volumes and computing systems, and in 1998 a second version, called HDF5, was implemented. HDF5 was more scalable than the original HDF (now called HDF4), and had many other improvements. The Department of Energy’s Sandia, Los Alamos, and Lawrence Livermore National Laboratories provided the core funding, technical requirements, and many of the people that made the new format possible. HDF5 quickly replaced HDF4 in popularity, and spread even more rapidly.
In the late 1990s and early 2000s the HDF Group faced increasing demands to ensure that HDF was robust, that HDF5 kept up with advancing technologies and data demands, and that we offer high quality professional support for HDF users. It soon became clear that the HDF Group could best serve these demands by striking out on its own, as an entity separate from the University and NCSA, who had nurtured us so well for 18 years.
In January 2005, The HDF Group was incorporated as a not-for-profit company. In July 2006, twelve of us set up shop in the University of Illinois Research Park, and we got ourselves a logo:

Our logo

Our initial funding came from a financial company that had adopted HDF5 to help gather and manage multiple high speed, high volume market data feeds. We provided them with support and a number of new capabilities in HDF5. The NASA EOS soon joined with contracts for the new company, as did two of the three DOE Labs.
The HDF Group chose to be a non-profit because we had a public mission, and we wanted to feel confident that the company would not be diverted from that mission for reasons of financial gain.

The HDF Group’s mission is:

To provide high quality software for managing large complex data, to provide outstanding services for users of these technologies, and to insure effective management of data throughout the data life cycle.

The mission has two goals:

1. To create, maintain, and evolve software and services that enable society to manage large complex data at every stage of the data life cycle.
2. To establish and maintain a sustainable organization with a highly-skilled and committed team devoted to accomplishing the first goal.

The rest is details. We’ll be getting into those details in future blog posts, and we’re hoping some of you will contribute.

Meanwhile, send your comments and questions. We’d love to hear from you.  Subscribe to our blog posts on the sidebar.  And if you’d like to do a post, please send an email to blog@hdfgroup.org.

Mike Folk