Python & HDF5 – A Vision

Anthony Scopatz, Assistant Professor at the University of South Carolina, HDF guest blogger

“Python is great and its ecosystem for scientific computing is world class. HDF5 is amazing and is rightly the gold standard for persistence for scientific data.  Many people use HDF5 from Python, and this number is only growing due to pandas’ HDFStore.  However, using HDF5 from Python has at least one more knot than it needs to.  Let’s change that.”

Picture4Almost immediately when going to use HDF5 from Python you are faced with a choice between two fantastic packages with overlapping capabilities: h5py and PyTables.  h5py wraps the HDF5 API more closely using autogenerated Cython.  PyTables, while also wrapping HDF5, focuses more on a Table data structure and adds in sophisticated indexing and out-of-core querying. Which package you use depends on your use case – and sometimes you really need both!

At SciPy 2015, developers from PyTables, h5py, The HDF Group, pandas, as well as community members sat down and talked about what to do to make the story for Python and HDF5 more streamlined and more maintainable.  Here is what we came up with:  Continue reading

Answering biological questions using HDF5 and physics-based simulation data

David Dotson, doctoral student, Center for Biological Physics, Arizona State University; HDF Guest Blogger

Recently I had the pleasure of meeting Anthony Scopatz for the first time at SciPy 2015, and we talked shop. I was interested in his opinions on MDSynthesis, a Python package our lab has designed to help manage the complexity of raw and derived data sets from molecular dynamics simulations, about which I was presenting a poster (click zip file to download).

Figure 1: Molecular dynamics simulation: Example of a molecular dynamics simulation in a simple system: deposition of a single Cu atom on a Cu (001) surface. Each circle illustrates the position of a single atom; note that the actual atomic interactions used in current simulations are more complex than those of 2-dimensional hard spheres. Image: Kai Nordlund, professor of computational materials physics, University of Helsinki.

In particular, I wanted his thoughts on how we are leveraging HDF5, and whether we could be doing it better.  The discussion gave me plenty to think about going forward, but it also put me in contact with some of the other folks involved in the Python ecosystem surrounding HDF5. Long story short, I was asked to share how we were using HDF5 with a guest post on the HDF Group blog.

First a bit of background. At the Beckstein Lab we perform physics-based simulations of proteins, the molecular machines of life, in order to get at how they do what they do. These simulations may include thousands to millions of atoms, with the raw data a trajectory of their positions with time, which can have hundreds to millions of frames.
Continue reading

HDF at SciPy2015

John Readey, The HDF Group


Interestingly enough, in addition to being known as the place to go for BBQ and live music, Austin, Texas is a major hub of Python development.  Each year, Austin is host to the annual confab of Python developers known as the SciPy Conference.  Enthought, a local Python-based company, was the major sponsor of the conference and did a great job of organizing the event.  By the way, Enthought is active in Python-based training, and I thought the tutorial sessions I attended were very well done.  If you would like to get some expert training on various aspects of Python, check out their offerings.

As a first-time conference attendee, I found attending the talks and tutorials very informative and entertaining.  The conference’s focus is the set of packages that form the core of the SciPy ecosystem (SciPy, iPython, NumPy, Pandas, Matplotlib, and SymPy) and the ever-increasing number of specialized packages around this core.      Continue reading

HDF5 as a zero-configuration, ad-hoc scientific database for Python

Andrew Collette, Research Scientist with IMPACT, HDF Guest Blogger

“…HDF5 is that rare product which excels in two fields: archiving and sharing data according to strict standardized conventions, and also ad-hoc, highly flexible and iterative use for local data analysis. For more information on using Python together with HDF5…”

Accelerator Lab at IMPACT image used with permission
IMPACT breaks the 100 km/s speed barrier in February, 2015.  The Dust Accelerator Lab detected their fastest dust grain to date. An iron grain with a charge of 0.2 fC and diameter of 30 nm was clocked at a speed of 107.6 km/s (or 240,694 mph).  Image used with permission from IMPACT.

An enormous amount of effort has gone into the HDF ecosystem over the past decade. Because of a concerted effort between The HDF Group, standards bodies, and analysis software vendors, HDF5 is one of the best technologies on the planet for sharing numerical data. Not only is the format itself platform-independent, but nearly every analysis platform in common use can read HDF5. This investment continues with tools like HDF Product Designer and the REST-based H5Serv project, for sharing data using the HDF5 object model over the Internet.

What I’d like to talk about today is something very different: the way that I and many others in the Python world use HDF5, not for widely-shared data but for data that may never even leave the local disk…   Continue reading