Blog

HDF5 Data Compression Demystified #1

Elena Pourmal, The HDF Group

What happened to my compression?

One of the most powerful features of HDF5 is the ability to compress or otherwise modify, or “filter,” your data during I/O.

By far, the most common user-defined filters are ones that perform data compression.  As you know, there are many compression options.

  • There are filters provided by the HDF5 library (“predefined filters,”) which include several types of filters for data compression, data shuffling and checksum.
  • Users can implement their own “user-defined filters” and employ them with the HDF5 library.

Cars in a 1973 Philadelphia junkyard – image from National Archives and Records Administration

While the programming model and usage of the compression filters is straightforward, it is possible for a new user to overlook important details and end up with data in HDF5 that fails to compress.  To wit, we’ve received questions at our Helpdesk such as, “I used the GZIP compression filter in my application, but the dataset didn’t get appreciably smaller.  It seems like it didn’t compress my data.”

First, an unchanged storage size or a low compression ratio doesn’t necessarily mean that compression “didn’t compress my data.” But this certainly suggests that something might have gone wrong when a compression filter was applied. How can you find out what happened?

The problem generally falls in one of the two categories:

1. The compression filter was not applied.
2. The compression filter was applied but was not effective.

The second result can occur in the rare instances when data is not compressible using the filter chosen.

The first result can happen when the compression filter is not available at run time, or if HDF5 can’t find it.  It is this result that this blog focuses on.  I’ll present a few troubleshooting techniques in case you happen to encounter a compression issue.

I am afraid at this point many of you are taking a deep breath to prepare for a cold shower of HDF5 technical details.  Fortunately, this is not the case.  However, there are some basics you should understand for creating and writing compressed datasets, and you should have an idea of HDF5’s filters pipeline.  Other than that, no special HDF5 knowledge is required to troubleshoot this problem.

A quick review can help you determine to which category the problem belongs. If the filter wasn’t applied at all, there are only two reasons:  it was not included at compile time when the library was built, or it was not found at run time for dynamically loaded filters.

But, wait a second… How might this happen? Wouldn’t the application fail? And the answer is… it depends.  Here is one important design feature you need to know to understand the HDF5 compression filter behavior.

When a filter is added to the I/O pipeline, the library is given an instruction as to what the H5Dwrite call should do if the filter fails – skip the filter and continue with I/O, or fail. The filter failure usually happens when the filter is not available or the size of the filtered data is actually bigger than the size of the original data.

For user-defined filters, an application can control the compression by using the H5Pset_filter function. However, an HDF5 application itself cannot control the behavior of the HDF5 predefined filters.  Only two filters, checksum Fletcher32 and SZIP compression, will cause H5Dwrite to fail; the absence of other filters will be happily ignored by the H5Dwrite call.

Now you can see that an HDF5 application that uses an HDF5 installation with the omitted compression libraries will succeed, but it produces an uncompressed HDF5 dataset.

You’d be surprised how often we encounter HDF5 installations where the ZLIB library was not configured in.  A simple typo on the configure line, or wrong path to the compression library can cause this problem.

Fortunately, there are three ways to confirm the absence or presence of the HDF5 predefined filters. One is by examining the HDF5 installation, specifically the libhdf5.settings file that can be found under the lib directory of the installed HDF5.

A second way is to use the H5Zfilter_avail call in your application and report missing compression at run time.

A third way to discover whether compression was applied is to use the HDF5 command-line utility, h5dump using the –pH flag, on the HDF5 files produced – and then examine the compression ratio.  The h5dump shows which filters and compression were expected to be used on a dataset, and how effective compression was, as the example below shows:

$ hdf5/bin/h5dump -p -H *.h5
HDF5 "h5ex_d_gzip.h5" {
GROUP "/" {
   DATASET "DS1" {
      DATATYPE  H5T_STD_I32LE
      DATASPACE  SIMPLE { ( 32, 64 ) / ( 32, 64 ) }
      STORAGE_LAYOUT {
         CHUNKED ( 5, 9 )
         SIZE 5018 (1.633:1 COMPRESSION)
      }
      FILTERS {
         COMPRESSION DEFLATE { LEVEL 9 }
      }

The output shows that the ZLIB (deflate) compression was applied with a compression ratio of 1.633:1. The compression ratio is defined as the ratio of “the original size” to “the storage size.” In this example, the size of the stored data is 5018 bytes vs. 8192 bytes of uncompressed data. Clearly, the compression was successfully applied.  If you see compression ratios less than or equal to 1:1, you can be sure that something went wrong.

When compression is less than 1:1 it means that data stored in the dataset is bigger than the original. This happens when the data was not compressed and some chunks have “ghost zones.”

When compression is 1:1, you will need to investigate further using tools like h5ls and h5debug.  When examining the metadata of the individual chunks, you may find that:

  • Compression (or some other filter) was not applied at all (this is usually the case when compression was not found by the HDF5 library) or,
  • It was applied but failed to compress the data.

There are many compression filters out there. It is impossible to say a priori which compression will produce the best results for your data. You can use the h5repack utility to experiment with different compression methods and find a combination of compression and other HDF5 filters that works best for your data.

Hopefully, now you will be able to easily achieve compression.

The next Helpdesk question that we get is, “Why is reading or writing to my compressed dataset slow?”  This will be the topic of the next blog on “HDF5 Data Compression Demystified #2;” stay tuned!

For more information on compression troubleshooting, check our technical notes – https://support.hdfgroup.org/HDF5/doc/TechNotes/TechNote-HDF5-CompressionTroubleshooting.pdf

No Comments

Leave a Comment