“…HDF5 is that rare product which excels in two fields: archiving and sharing data according to strict standardized conventions, and also ad-hoc, highly flexible and iterative use for local data analysis. For more information on using Python together with HDF5…”
An enormous amount of effort has gone into the HDF ecosystem over the past decade. Because of a concerted effort between The HDF Group, standards bodies, and analysis software vendors, HDF5 is one of the best technologies on the planet for sharing numerical data. Not only is the format itself platform-independent, but nearly every analysis platform in common use can read HDF5. This investment continues with tools like HDF Product Designer and the REST-based H5Serv project, for sharing data using the HDF5 object model over the Internet.
What I’d like to talk about today is something very different: the way that I and many others in the Python world use HDF5, not for widely-shared data but for data that may never even leave the local disk… Continue reading →
Fifteen years ago, NASA selected HDF as the format for the data products produced by NASA Satellites for the NASA Earth Observing System (EOS).
The HDF Earth Science Program is well aware of this important legacy. We focus on continuing support of U.S. environmental satellite programs (NASA Earth Observing Systemand Joint Polar Satellite System, JPSS), on-going quality assurance of the HDF libraries and helping data users access and understand products written in HDF. The HDF-EOS Information Center(#hdfeos) includes code examples in MATLAB, IDL, NCL, and Python, many driven by user questions. The site also provides information on other HDF tools.
NASA’s decision ensured a role for HDF in Earth Science and set an important precedent. HDF developers, along with the U.S. and other Earth Observing nations, developed a clear distinction between Earth Science Data Objects (grids, swaths, profiles…); the metadata required to describe them; and the HDF objects (datasets, groups, attributes, etc.) that make them up.
The critical realization was that communities like EOS needed conventions for describing Earth Science objects to enable using and sharing those objects. These conventions, termed HDF-EOS, have been used successfully in hundreds of NASA products that can be easily shared among multiple users using standard tools.
Many other Earth Science communities have used the powerful combinationof conventions and HDF. Continue reading →
Perhaps the original producers of “big data,” the oil & gas (O&G) industryheld its eighth annualHigh-Performance Computing (HPC) workshop in early March. Hosted by Rice University, the workshop brings in attendees from both the HPC and petroleum industries. Jan Odegard, the workshop organizer, invited me to the workshop to give a tutorial and short update on HDF5.
The workshop (#oghpc) has grown a great deal during the last few years and now has more than 500 people attending, with preliminary attendance numbers for this year’s workshop over 575 people (even in a “down” year for the industry). In fact, Jan’s pushing it to a “conference” next year, saying, “any workshop with more attendees than Congress is really a conference.” But it’s still a small enough crowd and venue that most people know each other well, both on the Oil & Gas and HPC sides.
The workshop program had two main tracks, one on HPC-oriented technologies that support the industry, and one on oil & gas technologies and how they can leverage HPC. The HPC track is interesting, but mostly “practical” and not research-oriented, unlike, for example, the SC technical track. The oil & gas track seems more research-focused, in ways that can enable the industry to be more productive.
“I would like to do something with all the datasets in all the HDF5 files in this directory, but I’m not sure how to proceed.”
If this sounds all too familiar, then reading this article might be worth your while. The accepted general answer is to write a Python script (and use h5py ), but I am not going to repeat here what you know already. Instead, I will show you how to hot-wire one of the new shiny engines, Apache Spark , and make a few suggestions on how to reduce the coding on your part while opening the door to new opportunities.
But what about Hadoop? There is no out-of-the-box interoperability between HDF5 and Hadoop. See our BigHDF FAQs  for a few glimmers of hope. Major points of contention remain such as HDFS’s “blocked” worldview and its aversion to relatively small objects, and then there is HDF5’s determination to keep its smarts away from prying eyes. Spark is more relaxed and works happily with HDFS, Amazon S3, and, yes, a local file system or NFS. More importantly, with its Resilient Distributed Datasets (RDD)  it raises the level of abstraction and overcomes several Hadoop/MapReduce shortcomings when dealing with iterative methods. See reference  for an in-depth discussion.
We are excited to introduce a blog series to share knowledge about HDF. The blog will include information about HDF technologies, uses of HDF, plans for HDF, our company and its mission, and anything else that might be of interest to HDF users and others who could enjoy the benefits of HDF.
Our staff will post regularly on the blog. We also welcome guest blogs from the community. If you’d like to do a post, please send an email to firstname.lastname@example.org.
We hope you will comment on blog posts and on the comments of others. Comments are moderated. We will review them and post them as quickly as possible.
The HDF blog does not replace our usual modes of communicating. We will continue to rely on the HDF website, the HDF forum, the HDF helpdesk, newsletters, bulletins, and Twitter.
Welcome, again, to the HDF Group Blog. Let this be the beginning of a lively and informative dialogue.
The HDF Group
We’d love to hear from you. What do you want us to write about? Let us know by commenting!
The first version of HDF was implemented the following spring. Over the next 10 years HDF enjoyed widespread interest and adoption for managing scientific and engineering data. The NASA Earth Observing System (EOS) was an early adopter of HDF. NASA provided much of the funding and technical requirements that made HDF a robust technology, able to support mission-critical applications.
By 1996 it became clear that HDF was not going to adequately address the demands of the next generation of data volumes and computing systems, and in 1998 a second version, called HDF5, was implemented. HDF5 was more scalable than the original HDF (now called HDF4), and had many other improvements. The Department of Energy’s Sandia, Los Alamos, and Lawrence Livermore National Laboratories provided the core funding, technical requirements, and many of the people that made the new format possible. HDF5 quickly replaced HDF4 in popularity, and spread even more rapidly.
In the late 1990s and early 2000s the HDF Group faced increasing demands to ensure that HDF was robust, that HDF5 kept up with advancing technologies and data demands, and that we offer high quality professional support for HDF users. It soon became clear that the HDF Group could best serve these demands by striking out on its own, as an entity separate from the University and NCSA, who had nurtured us so well for 18 years. In January 2005, The HDF Group was incorporated as a not-for-profit company. In July 2006, twelve of us set up shop in the University of Illinois Research Park, and we got ourselves a logo:
Our initial funding came from a financial company that had adopted HDF5 to help gather and manage multiple high speed, high volume market data feeds. We provided them with support and a number of new capabilities in HDF5. The NASA EOS soon joined with contracts for the new company, as did two of the three DOE Labs. The HDF Group chose to be a non-profit because we had a public mission, and we wanted to feel confident that the company would not be diverted from that mission for reasons of financial gain.
The HDF Group’s mission is:
To provide high quality software for managing large complex data, to provide outstanding services for users of these technologies, and to insure effective management of data throughout the data life cycle.
The mission has two goals:
1. To create, maintain, and evolve software and services that enable society to manage large complex data at every stage of the data life cycle. 2. To establish and maintain a sustainable organization with a highly-skilled and committed team devoted to accomplishing the first goal.
The rest is details. We’ll be getting into those details in future blog posts, and we’re hoping some of you will contribute.
Meanwhile, send your comments and questions. We’d love to hear from you. Subscribe to our blog posts on the sidebar. And if you’d like to do a post, please send an email to email@example.com.