HDF: The Next 30 Years (Part 1)

Dave Pearah, The HDF Group

How can users of open source technology ensure that the open source solutions they depend on every day don’t just survive, but thrive?

While on my flight home from New York, I’m reflecting on The Trading Show, which focused on tech solutions for the small but influential world of proprietary and quantitative financial trading. I participated in a panel called “Sharing is Caring,” regarding the industry’s broad use of open source technology.

The panel featured a mix of companies that both provide and use open source software. Among the topics:

  • Are cost pressures the only driving force behind the open source movement among trading firms, hedge funds and banks?
  • How will open source solutions shape the future of quant and algorithmic trading?
    And of particular interest,
  • How can we create an environment that encourages firms with proprietary technology to contribute back to open source projects?

The issue of open source sustainability received vigorous discussion. Many people make the mistake of assuming that open source software packages just “take care of themselves” or “will always be around,” but the evidence suggests otherwise.

The recent Ford Foundation report, “Roads and Bridges: The Unseen Labor Behind Our Digital Infrastructure” paints a very dim picture of poorly maintained or abandoned open source projects that are used by large communities of users (i.e. the issue is support, not adoption).

Klint Finley’s recent Wired article, Open Source Won. So, Now What? says, “Despite this mainstream success, many crucial open source projects—projects that major companies rely on—are woefully underfunded. And many haven’t quite found the egalitarian ideal that can really sustain them in the long term.”

This topic is near and dear to me as the CEO of The HDF Group. HDF has a long history of integrity and is very committed to its user community – survival is mandatory. At the same time, I bear the responsibility to ensure that The HDF Group’s technologies – a vitally important and broadly adopted technology portfolio – not only survive, but thrive.

In order to achieve this, we have to grow the HDF business. Why?

The HDF Group is a not-for-profit organization that makes money through consulting, typically in two forms:

  1. Adding functionality to the HDF Group’s software portfolio (i.e., HDF5, HDF4, HDFView, etc.)
  2. Helping people be successful with HDF (review, tune, correct, coach, train, educate, advise, etc.)

The profit from these activities funds the sustainability and evolution of HDF5. This is actually a fairly common open source business model and one that has worked well for us for nearly 30 years. So why change? Costs are increasing because the user base is growing: more user support, more testing, more configurations, etc. There is no corresponding increase in revenue to offset these costs.

We have an amazingly talented + passionate + dedicated team of folks who focus entirely on the HDF library. With either the sweat equity or financial equity of the user community, we can not only continue these efforts, but make plans to address new features and functions that benefit the entire user community.

In my next post – Part 2 – I’ll outline some of the ideas around how we plan to engage the HDF community and create a conversation around how to ensure HDF’s viability and relevance for the next 30 years. I’ll also outline a number of ways that you and your organizations can partner with us to make this a win-win for all stakeholders.

I look forward to this dialogue with you and I’m eager to see your blog comments!

Dave Pearah

 

…HDF5 has broad adoption in the financial services industry including High-frequency trading (HFT) firms, hedge funds, investment banks, pension boards and data syndicators. All sizes and types of financial firms rely on massive amounts of data to for trading, risk analysis, customer portfolio analysis, historical market research, and many other data intensive functions.

Many HDF adopters in finance have extremely large and complex datasets with very fast access requirements. Others turn to HDF because it allows them to easily share data across a wide variety of computational platforms using applications written in different programming languages. Some use HDF to take advantage of the many HDF-friendly tools used in financial analysis and modeling, such as MATLAB, Pandas, PyTables and R.

HDF technologies are relevant when the data challenges being faced push the limits of what can be addressed by traditional database systems, XML documents, or in-house data formats. Leveraging the powerful HDF products and the expertise of The HDF Group, organizations realize substantial cost savings while solving challenges that seemed intractable using other data management technologies. For more information, please visit our website product pages.

 

 

The Blosc meta-compressor

Francesc Alted, Freelance Consultant, HDF guest blogger

The HDF Group has a long history of collaboration with Francesc Alted, creator of PyTables.  Francesc was one of the first HDF5 application developers who successfully employed external compressions in an HDF5 application (PyTables). The first two compression methods that were registered with The HDF Group were LZO and BZIP2 implemented in PyTables; when Blosc was added to PyTables, it became a winner.

While HDF5 and PyTables address data organization and I/O needs for many applications, solutions like the Blosc meta-compressor presented in this blog, are simpler, achieve great I/O performance, and are alternative solutions to HDF5 in cases when portability and data organization are not critical, but compression is still desired.  Enjoy the read!

Why compression?

Compression is a hot topic in data handling. The largest database players have recently (or not-so-recently) implemented support for different kinds of compression libraries. Why is that? It’s all about efficiency: modern CPUs are so fast in comparison with storage write speeds that compression not only offers the opportunity to store more with less space, but to improve storage bandwidth also:compression read speed

The HDF5 library is an excellent example of a data container that supported out-of-the-box compression in the very first release of HDF5 in November 1998. Their innovation was to introduce support for compression of chunked datasets in a way that permitted the developer to apply compression to each of the chunks individually, resulting in reasonably fast and transparent compression using different codecs. HDF5 also introduced pluggable compression filters that allowed external developers to implement support for different codecs for HDF5. Then with release 1.8.11, they added the ability to discover, load and register filters at run time. More recently, in release 1.8.15 (and fully documented in 1.8.16), HDF5 has a new Plugin Interface that provides a complete programmatic control of dynamically loaded plugins. HDF5’s filter features now offer much-desired flexibility, giving users the freedom to choose the codec that best suits their needs.

Why Blosc?

In the last decade the trend has been to implement faster codecs at the expense of reduced compression ratios. The idea is to reduce compression/decompression time overhead Continue reading

Letter to the HDF User Community

Lindsay Powers – The HDF Group

The HDF Group provides free, open-source software that is widely used in government, academia and industry. The goal of The HDF Group is to ensure the sustainable development of HDF (Hierarchical Data Format) technologies and the ongoing accessibility of HDF-stored data because users and organizations have mission-critical systems and archives relying on these technologies. These users and organizations are a critical element of the HDF community and an important source of new and innovative uses of, and sustainability for, the HDF platforms, libraries and tools.

We want to create a sustainability model for the open access platforms and libraries that can serve these diverse communities in the future use and preservation of their data. As a step towards engaging this community, we are seeking partners for a National Science Foundation Research Coordination Network (RCN).

The National Science Foundation supports RCNs in order to foster collaboration and communication among scientists and technologists in the areas of research coordination, education and training, collaborative technologies, and standards development. Our vision of this RCN is to develop a core community of experienced and dedicated HDF users to:

  1. Foster education and training of new and existing users through development of teaching modules, workshops and other mechanisms for sharing knowledge and experience,
  2. Provide a forum for sharing tools and techniques related to HDF technologies,
  3. Convene diverse users to foster interdisciplinary collaboration, and
  4. Formalize a community of committed HDF users invested in the sustainability of HDF products.

Continue reading