HDF5 Data Compression Demystified #1

Elena Pourmal, The HDF Group

What happened to my compression?

One of the most powerful features of HDF5 is the ability to compress or otherwise modify, or “filter,” your data during I/O.

By far, the most common user-defined filters are ones that perform data compression.  As you know, there are many compression options.

  • There are filters provided by the HDF5 library (“predefined filters,”) which include several types of filters for data compression, data shuffling and checksum.
  • Users can implement their own “user-defined filters” and employ them with the HDF5 library.
Cars in a 1973 Philadelphia junkyard – image from National Archives and Records Administration

While the programming model and usage of the compression filters is straightforward, it is possible for a new user to overlook important details and end up with data in HDF5 that fails to compress.  To wit, we’ve received questions at our Helpdesk such as, “I used the GZIP compression filter in my application, but the dataset didn’t get appreciably smaller.  It seems like it didn’t compress my data.”

First, an unchanged storage size or a low compression ratio doesn’t necessarily mean that compression “didn’t compress my data.” But this certainly suggests that something might have gone wrong when a compression filter was applied. How can you find out what happened?

The problem generally falls in one of the two categories:

1. The compression filter was not applied.
2. The compression filter was applied but was not effective.

The second result can occur in the rare instances when data is not compressible using the filter chosen.

The first result can happen when the compression filter is not available at run time, or if HDF5 can’t find it.  It is this result that this blog focuses on.  I’ll present a few troubleshooting techniques in case you happen to encounter a compression issue.

I am afraid at this point many of you are taking a deep breath to prepare for a cold shower of HDF5 technical details.  Fortunately, this is not the case.    Continue reading

Putting some Spark into HDF-EOS

Gerd Heber and Joe Lee, The HDF Group

In an earlier blog post [3], we merely floated the idea of bulk-processing HDF5 files with Apache Spark. In this article, we follow up with a few simple use cases and some numbers for a data collection to which many readers will be able to relate.

If the first question on your mind is, “What kind of resources will I need?”, then you have a valid point, but you also might be the victim of BigData propaganda. Consider this: “Most people don’t realize how much number crunching they can do on a single computer.”

HDF HDF-EOS: EOS Satellite Image courtesy of Jesse Allen, NASA Earth Observatory/SSAI
Aura: “A mission dedicated to the health of the earth’s atmosphere” using HDF technologies.  EOS Satellite Image courtesy of Jesse Allen, NASA Earth Observatory/SSAI

“If you don’t have big data problems, you don’t need MapReduce and Hadoop. It’s great to know they exist and to know what you could do if you had big-data problems.” ([5], p. 323)  In this article, we focus on how far we can push our personal computing devices with Spark, and leave the discussion of Big Iron and Big Data vs. big data vs. big data, etc. for another day.  Continue reading

Parallel I/O – Why, How, and Where to?

Mohamad Chaarawi, The HDF Group

First in a series: parallel HDF5

What costs applications a lot of time and resources rather than doing actual computation?  Slow I/O.  It is well known that I/O subsystems are very slow compared to other parts of a computing system.  Applications use I/O to store simulation output for future use by analysis applications, to checkpoint application memory to guard against system failure, to exercise out-of-core techniques for data that does not fit in a processor’s memory, and so on.  I/O middleware libraries, such as HDF5, provide application users with a rich interface for I/O access to organize their data and store it efficiently.  A lot of effort is invested by such I/O libraries to reduce or completely hide the cost of I/O from applications.

Parallel I/O is one technique used to access data on disk simultaneously from different application processes to maximize bandwidth and speed things up. There are several ways to do parallel I/O, and I will highlight the most popular methods that are in use today.  

Blue Waters supercomputer
Blue Waters supercomputer at the National Center for Supercomputing, University of Illinois, Urbana-Champaign campus.  Blue Waters is supported by the National Science Foundation and the University of Illinois.

First, to leverage parallel I/O, it is very important that you have a parallel file system; Continue reading

HDF5 for the Web – HDF Server

John Readey, The HDF Group

HDF5 is a great way to store large data collections, but size can pose its own challenges.  As a thought experiment, imagine this scenario:

Monopoly ukYou write an application that creates the ultimate Monte Carlo simulation of the Monopoly game. The application plays through 1000’s of simulated games for a hundred different strategies and saves its results to an HDF5 file. Given that we want to capture all the data from each simulation, let’s suppose the resultant HDF5 file is over a gigabyte in size.

Naturally, you’d like to share these results with all your Monopoly-playing, statistically-minded friends, but herein lies the problem: How can you make this data accessible?  Your file is too large to put on Dropbox, and even if you did use an online storage provider, interested parties would need to download the entire file when perhaps they are only interested in the results for “Strategy #89: Buy just Park Place and Boardwalk.”  If we could store the data in one place, but enable access to it over the web using all the typical HDF5 operations (listing links, getting type information, dataset slices, etc.) that would be the answer to our conundrum.  Continue reading