HDF: The Next 30 Years (Part 1)

Dave Pearah, The HDF Group

How can users of open source technology ensure that the open source solutions they depend on every day don’t just survive, but thrive?

While on my flight home from New York, I’m reflecting on The Trading Show, which focused on tech solutions for the small but influential world of proprietary and quantitative financial trading. I participated in a panel called “Sharing is Caring,” regarding the industry’s broad use of open source technology.

The panel featured a mix of companies that both provide and use open source software. Among the topics:

  • Are cost pressures the only driving force behind the open source movement among trading firms, hedge funds and banks?
  • How will open source solutions shape the future of quant and algorithmic trading?
    And of particular interest,
  • How can we create an environment that encourages firms with proprietary technology to contribute back to open source projects?

The issue of open source sustainability received vigorous discussion. Many people make the mistake of assuming that open source software packages just “take care of themselves” or “will always be around,” but the evidence suggests otherwise.

The recent Ford Foundation report, “Roads and Bridges: The Unseen Labor Behind Our Digital Infrastructure” paints a very dim picture of poorly maintained or abandoned open source projects that are used by large communities of users (i.e. the issue is support, not adoption).

Klint Finley’s recent Wired article, Open Source Won. So, Now What? says, “Despite this mainstream success, many crucial open source projects—projects that major companies rely on—are woefully underfunded. And many haven’t quite found the egalitarian ideal that can really sustain them in the long term.”

This topic is near and dear to me as the CEO of The HDF Group. HDF has a long history of integrity and is very committed to its user community – survival is mandatory. At the same time, I bear the responsibility to ensure that The HDF Group’s technologies – a vitally important and broadly adopted technology portfolio – not only survive, but thrive.

In order to achieve this, we have to grow the HDF business. Why?

The HDF Group is a not-for-profit organization that makes money through consulting, typically in two forms:

  1. Adding functionality to the HDF Group’s software portfolio (i.e., HDF5, HDF4, HDFView, etc.)
  2. Helping people be successful with HDF (review, tune, correct, coach, train, educate, advise, etc.)

The profit from these activities funds the sustainability and evolution of HDF5. This is actually a fairly common open source business model and one that has worked well for us for nearly 30 years. So why change? Costs are increasing because the user base is growing: more user support, more testing, more configurations, etc. There is no corresponding increase in revenue to offset these costs.

We have an amazingly talented + passionate + dedicated team of folks who focus entirely on the HDF library. With either the sweat equity or financial equity of the user community, we can not only continue these efforts, but make plans to address new features and functions that benefit the entire user community.

In my next post – Part 2 – I’ll outline some of the ideas around how we plan to engage the HDF community and create a conversation around how to ensure HDF’s viability and relevance for the next 30 years. I’ll also outline a number of ways that you and your organizations can partner with us to make this a win-win for all stakeholders.

I look forward to this dialogue with you and I’m eager to see your blog comments!

Dave Pearah


…HDF5 has broad adoption in the financial services industry including High-frequency trading (HFT) firms, hedge funds, investment banks, pension boards and data syndicators. All sizes and types of financial firms rely on massive amounts of data to for trading, risk analysis, customer portfolio analysis, historical market research, and many other data intensive functions.

Many HDF adopters in finance have extremely large and complex datasets with very fast access requirements. Others turn to HDF because it allows them to easily share data across a wide variety of computational platforms using applications written in different programming languages. Some use HDF to take advantage of the many HDF-friendly tools used in financial analysis and modeling, such as MATLAB, Pandas, PyTables and R.

HDF technologies are relevant when the data challenges being faced push the limits of what can be addressed by traditional database systems, XML documents, or in-house data formats. Leveraging the powerful HDF products and the expertise of The HDF Group, organizations realize substantial cost savings while solving challenges that seemed intractable using other data management technologies. For more information, please visit our website product pages.



HDF5 and The Big Science of Nuclear Stockpile Stewardship

The August 2016 issue of Physics Today includes a fascinating piece titled, “The Big Science of stockpile stewardship.”1

The article leads with, “In the quarter century since the US last exploded a nuclear weapon, an extensive research enterprise has maintained the resources and know-how needed to preserve confidence in the country’s stockpile.”  It goes on to give the history of how the US Department of Energy (DOE) and its Los Alamos, Sandia and Lawrence Livermore national laboratories pioneered the use of high-performance computing to use computer simulation as a replacement for the actual building and testing of the USA’s nuclear weapons stockpile.

Although HDF5 is not named in this article, the history of The HDF Group and HDF5 are closely linked to this larger story of American science and geopolitics.  In 1993, DOE determined that its computing capabilities would require massive improvements, as the article says, to “ramp up computation speeds by a factor of 10,000 over the highest performing computers at the time, equivalent to a factor of 1 million over computers routinely used for nuclear calculations… To meet the [ten-year] goal, the DOE laboratories had to engage the computer industry in massively parallel processing, a technology that was just becoming available, to develop not just new hardware but new software and visualization techniques.”   Continue reading

HDFql – the new HDF tool that speaks SQL

Rick, HDFql team, HDF guest blogger

HDFql (Hierarchical Data Format query language) was recently released to enable users to handle HDF5 files with a language as easy and powerful as SQL. 

By providing a simpler, cleaner, and faster interface for HDF across C/C++/Java/Python/C#, HDFql aims to ease scientific computing, big data management, and real-time analytics. As the author of HDFql, Rick is collaborating with The HDF Group by integrating HDFql with tools such as HDF Compass, while continuously improving HDFql to feed user needs.

Introducing HDFql

HDFqlIf you’re handling HDF files on a regular basis, chances are you’ve had your (un)fair share of programming headaches. Sure, you might have gotten used to the hassle, but navigating the current APIs probably feels a tad like filing expense reports: rarely a complete pleasure!

If you’re new to HDF, you might seek to avoid the format all together. Even trained users have been known to occasionally scout for alternatives.  One doesn’t have to have a limited tolerance for unnecessary complexity to get queasy around these APIs – one simply needs a penchant for clean and simple data management.

This is what we heard from scientists and data veterans when asked about HDF. It’s what challenged our own synapses and inspired us to create HDFql. Because on the flip-side, we also heard something else:

  • HDF has proven immensely valuable in research and science
  • the data format pushes the boundaries on what is achievable with large and complex datasets
  • and it provides an edge on speed and fast access which is critical in the big data / advanced analytics arena

With an aspiration of becoming the de facto language for HDF, we hope that HDFql will play a vital role in the future of HDF data management by:

  • Enabling current users to arrive at (scientific) insights faster via cleaner data handling experiences
  • Inspiring prospective users to adopt the powerful data format HDF by removing current roadblocks
  • Perhaps even grabbing a few HDF challengers or dissenters along the way…

Continue reading

Easy access to the NASA HDF products via OPeNDAP’s Hyrax

MuQun (Kent) Yang, The HDF Group

Many NASA HDF and HDF5 data products can be visualized via the Hyrax OPeNDAP server through Hyrax’s HDF4 and HDF5 handlers.  Now we’ve enhanced the HDF5 OPeNDAP handler so that SMAP level 1, level 3 and level 4 products can be displayed properly using popular visualization tools.

Organizations in both the public and private sectors use HDF to meet long term, mission-critical data management needs. For example, NASA’s Earth Observing System, the primary data repository for understanding global climate change, uses HDF.  Over the lifetime of the project, which began in 1999, NASA has stored 15 petabytes of satellite data in HDF which will be accessible by NASA data centers and NASA HDF end users for many years to come.

In a previous blog, we discussed the concept of using the Hyrax OPeNDAP web server to serve NASA HDF4 and HDF5 products.  Each year, The HDF Group has enhanced the HDF4 and HDF5 handlers that work within the Hyrax OPeNDAP framework to support all sorts of NASA HDF data products, making them interoperable with popular Earth Science tools such as NASA’s Panoply and UCAR’s IDVThe Hyrax HDF4 and HDF5 handlers make data products display properly using popular visualization tools.  Continue reading

Announcing HDF5 1.10.0

We are excited and pleased to announce HDF5-1.10.0, the most powerful version of our flagship software ever.

HDF5 1.10.0 is now available

This major new release of HDF5 is more powerful than ever before and packed with new capabilities that address important data challenges faced by our user community.

HDF5 1.10.0 contains many important new features and changes, including those listed below. The features marked with * use new extensions to the HDF5 file format.

  •  The Single-Writer / Multiple-Reader or SWMR feature enables users to read data while concurrently writing it. *
  • The virtual dataset (VDS) feature enables users to access data in a collection of HDF5 files as a single HDF5 dataset and to use the HDF5 APIs to work with that dataset. *   (NOTE: There is a known issue with the h5repack utility when using it to modify the layout of a VDS. We understand the issue and are working on a patch for it.)
  • New indexing structures for chunked datasets were added to support SWMR and to optimize performance. *
  • Persistent free file space can now be managed and tracked for better performance. *
  • The HDF5 Collective Metadata I/O feature has been added to improve performance when reading and writing data collectively with Parallel HDF5.
  • The Java HDF5 JNI has been integrated into HDF5.
  • Changes were made in how autotools handles large file support.
  • New options for the storage and filtering of partial edge chunks have been added for performance tuning.*

* Files created with these new extensions will not be readable by applications based on the HDF5-1.8 library.

We would like to thank you, our user community, for your support, and your input and feedback which helped shape this important release.

The HDF Group

Solutions to Data Challenges

Please refer to the following document which describes the new features in this release:   https://www.hdfgroup.org/HDF5/docNewFeatures/

All new and modified APIs are listed in detail in the “HDF5 Software Changes from Release to Release” document:     https://www.hdfgroup.org/HDF5/doc/ADGuide/Changes.html

For detailed information regarding this release see the release notes:     https://www.hdfgroup.org/ftp/HDF5/releases/hdf5-1.10/hdf5-1.10.0/src/hdf5-1.10.0-RELEASE.txt

For questions regarding these or other HDF issues, contact:      help@hdfgroup.org

Links to the HDF5 1.10.0 source code, documentation, and additional materials can be found on the HDF5 web page at:     https://www.hdfgroup.org/HDF5/

The HDF5 1.10.0 release can be obtained directly from:   https://www.hdfgroup.org/HDF5/release/obtain5110.html

User documentation for 1.10.0 can be accessed from:   https://www.hdfgroup.org/HDF5/doc/

The Blosc meta-compressor

Francesc Alted, Freelance Consultant, HDF guest blogger

The HDF Group has a long history of collaboration with Francesc Alted, creator of PyTables.  Francesc was one of the first HDF5 application developers who successfully employed external compressions in an HDF5 application (PyTables). The first two compression methods that were registered with The HDF Group were LZO and BZIP2 implemented in PyTables; when Blosc was added to PyTables, it became a winner.

While HDF5 and PyTables address data organization and I/O needs for many applications, solutions like the Blosc meta-compressor presented in this blog, are simpler, achieve great I/O performance, and are alternative solutions to HDF5 in cases when portability and data organization are not critical, but compression is still desired.  Enjoy the read!

Why compression?

Compression is a hot topic in data handling. The largest database players have recently (or not-so-recently) implemented support for different kinds of compression libraries. Why is that? It’s all about efficiency: modern CPUs are so fast in comparison with storage write speeds that compression not only offers the opportunity to store more with less space, but to improve storage bandwidth also:compression read speed

The HDF5 library is an excellent example of a data container that supported out-of-the-box compression in the very first release of HDF5 in November 1998. Their innovation was to introduce support for compression of chunked datasets in a way that permitted the developer to apply compression to each of the chunks individually, resulting in reasonably fast and transparent compression using different codecs. HDF5 also introduced pluggable compression filters that allowed external developers to implement support for different codecs for HDF5. Then with release 1.8.11, they added the ability to discover, load and register filters at run time. More recently, in release 1.8.15 (and fully documented in 1.8.16), HDF5 has a new Plugin Interface that provides a complete programmatic control of dynamically loaded plugins. HDF5’s filter features now offer much-desired flexibility, giving users the freedom to choose the codec that best suits their needs.

Why Blosc?

In the last decade the trend has been to implement faster codecs at the expense of reduced compression ratios. The idea is to reduce compression/decompression time overhead Continue reading

HDF5 and .NET: One step back, two steps forward

Gerd Heber, The HDF Group and Haymo Kutschbach,* ILNumerics

Metaphorically speaking, this blog post is about a frog trying to climb out of a well, a damp and unsightly corner of the HDF5 ecosystem called HDF5.NET. People who know more about its genesis tell us that it was never intended as what it became to be perceived as, an “aspirational” .NET interface for HDF5 that would one day be complete and fully supported. Be that as it may, it’s important to ask, “What can we do today to better serve the needs of the .NET community?” We believe, as the title suggests, we need to take a step back to move forward.  Continue reading

To Serve and Protect: Web Security for HDF5

John Readey, The HDF Group

HDF Server is a new product from The HDF Group which enables HDF5 resources to be accessed and modified using Hypertext Transfer Protocol (HTTP).

HDF Server [1], released in February 2015, was first developed as a proof of concept that enabled remote access to HDF5 content using a RESTful API.  HDF Server version 0.1.0 wasn’t yet intended for use in a production environment since it didn’t initially provide a set of security features and controls.  Following its successful debut, The HDF Group incorporated additional planned features.  The newest version of HDF Server provides exciting capabilities for accessing HDF5 data in an easy and secure way.
Continue reading