HDF: The Next 30 Years (Part 1)

Dave Pearah, The HDF Group

How can users of open source technology ensure that the open source solutions they depend on every day don’t just survive, but thrive?

While on my flight home from New York, I’m reflecting on The Trading Show, which focused on tech solutions for the small but influential world of proprietary and quantitative financial trading. I participated in a panel called “Sharing is Caring,” regarding the industry’s broad use of open source technology.

The panel featured a mix of companies that both provide and use open source software. Among the topics:

  • Are cost pressures the only driving force behind the open source movement among trading firms, hedge funds and banks?
  • How will open source solutions shape the future of quant and algorithmic trading?
    And of particular interest,
  • How can we create an environment that encourages firms with proprietary technology to contribute back to open source projects?

The issue of open source sustainability received vigorous discussion. Many people make the mistake of assuming that open source software packages just “take care of themselves” or “will always be around,” but the evidence suggests otherwise.

The recent Ford Foundation report, “Roads and Bridges: The Unseen Labor Behind Our Digital Infrastructure” paints a very dim picture of poorly maintained or abandoned open source projects that are used by large communities of users (i.e. the issue is support, not adoption).

Klint Finley’s recent Wired article, Open Source Won. So, Now What? says, “Despite this mainstream success, many crucial open source projects—projects that major companies rely on—are woefully underfunded. And many haven’t quite found the egalitarian ideal that can really sustain them in the long term.”

This topic is near and dear to me as the CEO of The HDF Group. HDF has a long history of integrity and is very committed to its user community – survival is mandatory. At the same time, I bear the responsibility to ensure that The HDF Group’s technologies – a vitally important and broadly adopted technology portfolio – not only survive, but thrive.

In order to achieve this, we have to grow the HDF business. Why?

The HDF Group is a not-for-profit organization that makes money through consulting, typically in two forms:

  1. Adding functionality to the HDF Group’s software portfolio (i.e., HDF5, HDF4, HDFView, etc.)
  2. Helping people be successful with HDF (review, tune, correct, coach, train, educate, advise, etc.)

The profit from these activities funds the sustainability and evolution of HDF5. This is actually a fairly common open source business model and one that has worked well for us for nearly 30 years. So why change? Costs are increasing because the user base is growing: more user support, more testing, more configurations, etc. There is no corresponding increase in revenue to offset these costs.

We have an amazingly talented + passionate + dedicated team of folks who focus entirely on the HDF library. With either the sweat equity or financial equity of the user community, we can not only continue these efforts, but make plans to address new features and functions that benefit the entire user community.

In my next post – Part 2 – I’ll outline some of the ideas around how we plan to engage the HDF community and create a conversation around how to ensure HDF’s viability and relevance for the next 30 years. I’ll also outline a number of ways that you and your organizations can partner with us to make this a win-win for all stakeholders.

I look forward to this dialogue with you and I’m eager to see your blog comments!

Dave Pearah

 

…HDF5 has broad adoption in the financial services industry including High-frequency trading (HFT) firms, hedge funds, investment banks, pension boards and data syndicators. All sizes and types of financial firms rely on massive amounts of data to for trading, risk analysis, customer portfolio analysis, historical market research, and many other data intensive functions.

Many HDF adopters in finance have extremely large and complex datasets with very fast access requirements. Others turn to HDF because it allows them to easily share data across a wide variety of computational platforms using applications written in different programming languages. Some use HDF to take advantage of the many HDF-friendly tools used in financial analysis and modeling, such as MATLAB, Pandas, PyTables and R.

HDF technologies are relevant when the data challenges being faced push the limits of what can be addressed by traditional database systems, XML documents, or in-house data formats. Leveraging the powerful HDF products and the expertise of The HDF Group, organizations realize substantial cost savings while solving challenges that seemed intractable using other data management technologies. For more information, please visit our website product pages.

 

 

The HDF Group welcomes new CEO Dave Pearah

HDF
Pearah joins The HDF Group as new Chief Executive Officer

Champaign, IL —  The HDF Group today announced that its Board of Directors has appointed David Pearah as its new Chief Executive Officer. The HDF Group is a software company dedicated to creating high performance computing technology to address many of today’s Big Data challenges.

Pearah replaces Mike Folk upon his retirement after ten years as company President and Board Chair. Folk will remain a member of the Board of Directors, and Pearah will become the company’s Chairman of the Board of Directors.

Pearah said, “I am honored to have been selected as The HDF Group’s next CEO. It is a privilege to be part of an organization with a nearly 30-year history of delivering innovative technology to meet the Big Data demands of commercial industry, scientific research and governmental clients.”

Industry leaders in fields from aerospace and biomedicine to finance join the company’s client list.  In addition, government entities such as the Department of Energy and NASA, numerous research facilities, and scientists in disciplines from climate study to astrophysics depend on HDF technologies.

Pearah continued, “We are an organization led by a mission to make a positive impact on everyone we engage, whether they are individuals using our open-source software, or organizations who rely on our talented team of scientists and engineers as trusted partners. I will do my best to serve the HDF community by enabling our team to fulfill their passion to make a difference.  We’ve just delivered a major release of HDF5 with many additional powerful features, and we’re very excited about several innovative new products that we’ll soon be making available to our user community.”

“Dave is clearly the leader for HDF’s future, and Continue reading

Announcing HDF5 1.10.0

We are excited and pleased to announce HDF5-1.10.0, the most powerful version of our flagship software ever.

HDF5
HDF5 1.10.0 is now available

This major new release of HDF5 is more powerful than ever before and packed with new capabilities that address important data challenges faced by our user community.

HDF5 1.10.0 contains many important new features and changes, including those listed below. The features marked with * use new extensions to the HDF5 file format.

  •  The Single-Writer / Multiple-Reader or SWMR feature enables users to read data while concurrently writing it. *
  • The virtual dataset (VDS) feature enables users to access data in a collection of HDF5 files as a single HDF5 dataset and to use the HDF5 APIs to work with that dataset. *   (NOTE: There is a known issue with the h5repack utility when using it to modify the layout of a VDS. We understand the issue and are working on a patch for it.)
  • New indexing structures for chunked datasets were added to support SWMR and to optimize performance. *
  • Persistent free file space can now be managed and tracked for better performance. *
  • The HDF5 Collective Metadata I/O feature has been added to improve performance when reading and writing data collectively with Parallel HDF5.
  • The Java HDF5 JNI has been integrated into HDF5.
  • Changes were made in how autotools handles large file support.
  • New options for the storage and filtering of partial edge chunks have been added for performance tuning.*

* Files created with these new extensions will not be readable by applications based on the HDF5-1.8 library.

We would like to thank you, our user community, for your support, and your input and feedback which helped shape this important release.

The HDF Group

Solutions to Data Challenges

Please refer to the following document which describes the new features in this release:   https://www.hdfgroup.org/HDF5/docNewFeatures/

All new and modified APIs are listed in detail in the “HDF5 Software Changes from Release to Release” document:     https://www.hdfgroup.org/HDF5/doc/ADGuide/Changes.html

For detailed information regarding this release see the release notes:     https://www.hdfgroup.org/ftp/HDF5/releases/hdf5-1.10/hdf5-1.10.0/src/hdf5-1.10.0-RELEASE.txt

For questions regarding these or other HDF issues, contact:      help@hdfgroup.org

Links to the HDF5 1.10.0 source code, documentation, and additional materials can be found on the HDF5 web page at:     https://www.hdfgroup.org/HDF5/

The HDF5 1.10.0 release can be obtained directly from:   https://www.hdfgroup.org/HDF5/release/obtain5110.html

User documentation for 1.10.0 can be accessed from:   https://www.hdfgroup.org/HDF5/doc/

The HDF Group’s HPC Program

Quincey Koziol, The HDF Group

“A supercomputer is a device for turning compute-bound problems into I/O-bound problems.” – Ken Batcher, Prof. Emeritus, Kent State University.

HDF5 began out of a collaboration between the National Center for Supercomputing Applications (NCSA) and the US Department of Energy’s Advanced Simulation and Computing Program (ASC), so high-performance computing (HPC) I/O has been in our focus from the very beginning.  As we are starting our 20th year of development on HDF5, HPC I/O continues to be a critical driver of new features.

Los Alamos National Laboratory is home to two of the world’s most powerful supercomputers, each capable of performing more than 1,000 trillion operations per second. Here, ASC is examining the effects of a one-megaton nuclear energy source detonated on the surface of an asteroid. Image from ASC at http://www.lanl.gov/asci/

The HDF5 development team has focused on three things when serving the HPC community: performance, freedom of choice and ease of use. Continue reading

Parallel I/O – Why, How, and Where to?

Mohamad Chaarawi, The HDF Group

First in a series: parallel HDF5

What costs applications a lot of time and resources rather than doing actual computation?  Slow I/O.  It is well known that I/O subsystems are very slow compared to other parts of a computing system.  Applications use I/O to store simulation output for future use by analysis applications, to checkpoint application memory to guard against system failure, to exercise out-of-core techniques for data that does not fit in a processor’s memory, and so on.  I/O middleware libraries, such as HDF5, provide application users with a rich interface for I/O access to organize their data and store it efficiently.  A lot of effort is invested by such I/O libraries to reduce or completely hide the cost of I/O from applications.

Parallel I/O is one technique used to access data on disk simultaneously from different application processes to maximize bandwidth and speed things up. There are several ways to do parallel I/O, and I will highlight the most popular methods that are in use today.  

Blue Waters supercomputer
Blue Waters supercomputer at the National Center for Supercomputing, University of Illinois, Urbana-Champaign campus.  Blue Waters is supported by the National Science Foundation and the University of Illinois.

First, to leverage parallel I/O, it is very important that you have a parallel file system; Continue reading

HDF at the 2015 Oil & Gas High Performance Computing Workshop

Quincey Koziol, The HDF Group

nasa.gov
photo from NASA.gov

Perhaps the original producers of “big data,” the oil & gas (O&G) industry held its eighth annual High-Performance Computing (HPC) workshop in early March.    Hosted by Rice University, the workshop brings in attendees from both the HPC and petroleum industries.  Jan Odegard, the workshop organizer, invited me to the workshop to give a tutorial and short update on HDF5.

2015-03-18 09_08_46-▶ Rice 2014 Oil & Gas High Performance Computing Workshop - YouTube snapshot
Rice University hosts 2015 O & G HPC Workshop

The workshop (#oghpc) has grown a great deal during the last few years and now has more than 500 people attending, with preliminary attendance numbers for this year’s workshop over 575 people (even in a “down” year for the industry).  In fact, Jan’s pushing it to a “conference” next year, saying, “any workshop with more attendees than Congress is really a conference.” But it’s still a small enough crowd and venue that most people know each other well, both on the Oil & Gas and HPC sides.

The workshop program had two main tracks, one on HPC-oriented technologies that support the industry, and one on oil & gas technologies and how they can leverage HPC.  The HPC track is interesting, but mostly “practical” and not research-oriented, unlike, for example, the SC technical track. The oil & gas track seems more research-focused, in ways that can enable the industry to be more productive.

I gave an hour and a half tutorial on developing and tuning parallel HDF5 applications, which Continue reading