HDFql – the new HDF tool that speaks SQL

Rick, HDFql team, HDF guest blogger

HDFql (Hierarchical Data Format query language) was recently released to enable users to handle HDF5 files with a language as easy and powerful as SQL. 

By providing a simpler, cleaner, and faster interface for HDF across C/C++/Java/Python/C#, HDFql aims to ease scientific computing, big data management, and real-time analytics. As the author of HDFql, Rick is collaborating with The HDF Group by integrating HDFql with tools such as HDF Compass, while continuously improving HDFql to feed user needs.

Introducing HDFql

HDFqlIf you’re handling HDF files on a regular basis, chances are you’ve had your (un)fair share of programming headaches. Sure, you might have gotten used to the hassle, but navigating the current APIs probably feels a tad like filing expense reports: rarely a complete pleasure!

If you’re new to HDF, you might seek to avoid the format all together. Even trained users have been known to occasionally scout for alternatives.  One doesn’t have to have a limited tolerance for unnecessary complexity to get queasy around these APIs – one simply needs a penchant for clean and simple data management.

This is what we heard from scientists and data veterans when asked about HDF. It’s what challenged our own synapses and inspired us to create HDFql. Because on the flip-side, we also heard something else:

  • HDF has proven immensely valuable in research and science
  • the data format pushes the boundaries on what is achievable with large and complex datasets
  • and it provides an edge on speed and fast access which is critical in the big data / advanced analytics arena

With an aspiration of becoming the de facto language for HDF, we hope that HDFql will play a vital role in the future of HDF data management by:

  • Enabling current users to arrive at (scientific) insights faster via cleaner data handling experiences
  • Inspiring prospective users to adopt the powerful data format HDF by removing current roadblocks
  • Perhaps even grabbing a few HDF challengers or dissenters along the way…

Continue reading

Python & HDF5 – A Vision

Anthony Scopatz, Assistant Professor at the University of South Carolina, HDF guest blogger

“Python is great and its ecosystem for scientific computing is world class. HDF5 is amazing and is rightly the gold standard for persistence for scientific data.  Many people use HDF5 from Python, and this number is only growing due to pandas’ HDFStore.  However, using HDF5 from Python has at least one more knot than it needs to.  Let’s change that.”

Picture4Almost immediately when going to use HDF5 from Python you are faced with a choice between two fantastic packages with overlapping capabilities: h5py and PyTables.  h5py wraps the HDF5 API more closely using autogenerated Cython.  PyTables, while also wrapping HDF5, focuses more on a Table data structure and adds in sophisticated indexing and out-of-core querying. Which package you use depends on your use case – and sometimes you really need both!

At SciPy 2015, developers from PyTables, h5py, The HDF Group, pandas, as well as community members sat down and talked about what to do to make the story for Python and HDF5 more streamlined and more maintainable.  Here is what we came up with:  Continue reading

Answering biological questions using HDF5 and physics-based simulation data

David Dotson, doctoral student, Center for Biological Physics, Arizona State University; HDF Guest Blogger

Recently I had the pleasure of meeting Anthony Scopatz for the first time at SciPy 2015, and we talked shop. I was interested in his opinions on MDSynthesis, a Python package our lab has designed to help manage the complexity of raw and derived data sets from molecular dynamics simulations, about which I was presenting a poster (click zip file to download).

molecular
Figure 1: Molecular dynamics simulation: Example of a molecular dynamics simulation in a simple system: deposition of a single Cu atom on a Cu (001) surface. Each circle illustrates the position of a single atom; note that the actual atomic interactions used in current simulations are more complex than those of 2-dimensional hard spheres. https://en.wikipedia.org/wiki/Molecular_dynamics Image: Kai Nordlund, professor of computational materials physics, University of Helsinki.

In particular, I wanted his thoughts on how we are leveraging HDF5, and whether we could be doing it better.  The discussion gave me plenty to think about going forward, but it also put me in contact with some of the other folks involved in the Python ecosystem surrounding HDF5. Long story short, I was asked to share how we were using HDF5 with a guest post on the HDF Group blog.

First a bit of background. At the Beckstein Lab we perform physics-based simulations of proteins, the molecular machines of life, in order to get at how they do what they do. These simulations may include thousands to millions of atoms, with the raw data a trajectory of their positions with time, which can have hundreds to millions of frames.
Continue reading

Get your Bearings with HDF Compass

John Readey, The HDF Group

hdf

We’ve recently announced a new viewer application for HDF5 files: HDF Compass. In this blog post we’ll explore the motivations for providing this tool, review its features, and speculate a bit about future direction for Compass.

HDF Compass is a desktop viewer application for HDF5 and other file formats. A free and open source software product, it runs on Mac OS X, Windows, and Linux.  


Continue reading

America Runs on Excel and HDF5*

* With Python’s Help

Gerd Heber, The HDF Group

Before the recent release of our PyHexad Excel add-in for HDF5 [1], the title might have sounded like the slogan of a global coffee and baked goods chain. That was then. Today, it is an expression of hope for the spreadsheet users who run this country and who either felt neglected by the HDF5 community or who might suffer from a medical condition known as data-bulging workbook stress disorder. In this article, I would like to give you a quick overview of the novel PyHexad therapy and invite you to get involved (after consulting with your doctor).

To access the data in HDF5 files from Excel is a frontrunner among the all-time TOP 10 most frequently asked for features. A spreadsheet tool might be a convenient window into, and user interface for, certain data stored in HDF5 files. Such a tool could help overcome Excel storage and performance limitations, and allow data to be freely “shuttled” between worksheets and HDF5 data containers. PyHexad ([4],[5],[6],[7]) is an attempt to further explore this concept.   Continue reading

HDF5 for the Web – HDF Server

John Readey, The HDF Group

HDF5 is a great way to store large data collections, but size can pose its own challenges.  As a thought experiment, imagine this scenario:

Monopoly ukYou write an application that creates the ultimate Monte Carlo simulation of the Monopoly game. The application plays through 1000’s of simulated games for a hundred different strategies and saves its results to an HDF5 file. Given that we want to capture all the data from each simulation, let’s suppose the resultant HDF5 file is over a gigabyte in size.

Naturally, you’d like to share these results with all your Monopoly-playing, statistically-minded friends, but herein lies the problem: How can you make this data accessible?  Your file is too large to put on Dropbox, and even if you did use an online storage provider, interested parties would need to download the entire file when perhaps they are only interested in the results for “Strategy #89: Buy just Park Place and Boardwalk.”  If we could store the data in one place, but enable access to it over the web using all the typical HDF5 operations (listing links, getting type information, dataset slices, etc.) that would be the answer to our conundrum.  Continue reading

HDF5 as a zero-configuration, ad-hoc scientific database for Python

Andrew Collette, Research Scientist with IMPACT, HDF Guest Blogger

“…HDF5 is that rare product which excels in two fields: archiving and sharing data according to strict standardized conventions, and also ad-hoc, highly flexible and iterative use for local data analysis. For more information on using Python together with HDF5…”

Accelerator Lab at IMPACT image used with permission
IMPACT breaks the 100 km/s speed barrier in February, 2015.  The Dust Accelerator Lab detected their fastest dust grain to date. An iron grain with a charge of 0.2 fC and diameter of 30 nm was clocked at a speed of 107.6 km/s (or 240,694 mph).  Image used with permission from IMPACT.

An enormous amount of effort has gone into the HDF ecosystem over the past decade. Because of a concerted effort between The HDF Group, standards bodies, and analysis software vendors, HDF5 is one of the best technologies on the planet for sharing numerical data. Not only is the format itself platform-independent, but nearly every analysis platform in common use can read HDF5. This investment continues with tools like HDF Product Designer and the REST-based H5Serv project, for sharing data using the HDF5 object model over the Internet.

What I’d like to talk about today is something very different: the way that I and many others in the Python world use HDF5, not for widely-shared data but for data that may never even leave the local disk…   Continue reading