HDFql (Hierarchical Data Format query language) was recently released to enable users to handle HDF5 files with a language as easy and powerful as SQL.
By providing a simpler, cleaner, and faster interface for HDF across C/C++/Java/Python/C#, HDFql aims to ease scientific computing, big data management, and real-time analytics. As the author of HDFql, Rick is collaborating with The HDF Group by integrating HDFql with tools such as HDF Compass, while continuously improving HDFql to feed user needs.
If you’re handling HDF files on a regular basis, chances are you’ve had your (un)fair share of programming headaches. Sure, you might have gotten used to the hassle, but navigating the current APIs probably feels a tad like filing expense reports: rarely a complete pleasure!
If you’re new to HDF, you might seek to avoid the format all together. Even trained users have been known to occasionally scout for alternatives. One doesn’t have to have a limited tolerance for unnecessary complexity to get queasy around these APIs – one simply needs a penchant for clean and simple data management.
This is what we heard from scientists and data veterans when asked about HDF. It’s what challenged our own synapses and inspired us to create HDFql. Because on the flip-side, we also heard something else:
HDF has proven immensely valuable in research and science
the data format pushes the boundaries on what is achievable with large and complex datasets
and it provides an edge on speed and fast access which is critical in the big data / advanced analytics arena
With an aspiration of becoming the de facto language for HDF, we hope that HDFql will play a vital role in the future of HDF data management by:
Enabling current users to arrive at (scientific) insights faster via cleaner data handling experiences
Inspiring prospective users to adopt the powerful data format HDF by removing current roadblocks
Perhaps even grabbing a few HDF challengers or dissenters along the way…
In the face of naysayers, the SQL abides. Read in our latest blog post how HDF5 and ODBC tie the room together.
When I order my HDF5 at Moe’s Data Diner, I usually ask for extra napkins. It’s a meal where you need both hands, and it can be messy before you get to the juicy bits. At least that’s the way it used to be. It’s easier to dissect with h5py , but what’s the hungry stranger to do, who is just coming through town and who is clueless? When asked the other night, I didn’t fall off my chair nor did I choke, but the question got my head going. What would it look like, that SQL on-the-side thing?
Say, I have an HDF5 dataset at /group1/A/dset2 and would like to select a few elements like so:
SELECT * FROM /group1/A/dset2 WHERE value > -999.0
Nice, but how do I get the result into my favorite analytics tool? Isn’t there some standard pipe or conduit that helps me over that last mile? Well, it’s kind of embarrassing to admit, but it’s been there since the early 1990s and is called Open Database Connectivity (ODBC) .
Think of ODBC as the “USB of data sources.” If you have a USB driver for a device, it’s game on. If you have an ODBC driver for your data source, then it’s SQL, milk, and honey from here. A growing number of applications come with some module or package for accessing data stored in HDF5 files, but you can be almost certain that your tool of choice has an ODBC client built in.
To build an HDF5/ODBC driver, we need a splash of the “secret sauce” shown in the figure above, and this is the subject of this blog post. Spoiler Alert: We are not giving away the recipe.
Anthony Scopatz, Assistant Professor at the University of South Carolina, HDF guest blogger
“Python is great and its ecosystem for scientific computing is world class. HDF5 is amazing and is rightly the gold standard for persistence for scientific data.Many people use HDF5 from Python, and this number is only growing due to pandas’ HDFStore. However, using HDF5 from Python has at least one more knot than it needs to. Let’s change that.”
Almost immediately when going to use HDF5 from Python you are faced with a choice between two fantastic packages with overlapping capabilities: h5py and PyTables. h5py wraps the HDF5 API more closely using autogenerated Cython. PyTables, while also wrapping HDF5, focuses more on a Table data structure and adds in sophisticated indexing and out-of-core querying. Which package you use depends on your use case – and sometimes you really need both!
At SciPy 2015, developers from PyTables, h5py, The HDF Group, pandas, as well as community members sat down and talked about what to do to make the story for Python and HDF5 more streamlined and more maintainable. Here is what we came up with: Continue reading →
The 2015 HDF workshop held during the ESIP Summer Meeting was a great success thanks to more than 40 participants throughout the four sessions. The workshop was an excellent opportunity for us to interact with HDF community members to better understand their needs and introduce them to new technologies. You can view the slide presentations from the workshop here.
From my perspective, the highlight of the workshop was the Vendors and Tools Session where we heard from Ellen Johnson (Mathworks), Christine White (Esri), Brian Tisdale (NASA), and Gerd Heber (The HDF Group) talk about new, and improved applications of HDF technologies. For example: Continue reading →
David Dotson, doctoral student, Center for Biological Physics, Arizona State University; HDF Guest Blogger
Recently I had the pleasure of meeting Anthony Scopatz for the first time at SciPy 2015, and we talked shop. I was interested in his opinions on MDSynthesis, a Python package our lab has designed to help manage the complexity of raw and derived data sets from molecular dynamics simulations, about which I was presenting a poster (click zip file to download).
In particular, I wanted his thoughts on how we are leveraging HDF5, and whether we could be doing it better. The discussion gave me plenty to think about going forward, but it also put me in contact with some of the other folks involved in the Python ecosystem surrounding HDF5. Long story short, I was asked to share how we were using HDF5 with a guest post on the HDF Group blog.
First a bit of background. At the Beckstein Lab we perform physics-based simulations of proteins, the molecular machines of life, in order to get at how they do what they do. These simulations may include thousands to millions of atoms, with the raw data a trajectory of their positions with time, which can have hundreds to millions of frames. Continue reading →
“Any software used in the computational sciences needs to excel in the area of high performance computing (HPC).”
The Computational Fluid Dynamics (CFD) General Notation System (CGNS) is an effort to standardize CFD input and output data, including grid (both structured and unstructured), flow solution, connectivity, boundary conditions, and auxiliary information. It provides a general, portable, and extensible standard for the storage and retrieval of CFD analysis data. The system consists of two parts: (1) a standard format for recording the data, and (2) software that reads, writes, and modifies data in that format. Continue reading →
Before the recent release of our PyHexad Excel add-in for HDF5, the title might have sounded like the slogan of a global coffee and baked goods chain. That was then. Today, it is an expression of hope for the spreadsheet users who run this country and who either felt neglected by the HDF5 community or who might suffer from a medical condition known as data-bulging workbook stress disorder. In this article, I would like to give you a quick overview of the novel PyHexad therapy and invite you to get involved (after consulting with your doctor).
To access the data in HDF5 files from Excel is a frontrunner among the all-time TOP 10 most frequently asked for features. A spreadsheet tool might be a convenient window into, and user interface for, certain data stored in HDF5 files. Such a tool could help overcome Excel storage and performance limitations, and allow data to be freely “shuttled” between worksheets and HDF5 data containers. PyHexad (,,,) is an attempt to further explore this concept. Continue reading →