Blog

America Runs on Excel and HDF5*

* With Python’s Help

Gerd Heber, The HDF Group

Before the recent release of our PyHexad Excel add-in for HDF5[1], the title might have sounded like the slogan of a global coffee and baked goods chain. That was then. Today, it is an expression of hope for the spreadsheet users who run this country and who either felt neglected by the HDF5 community or who might suffer from a medical condition known as data-bulging workbook stress disorder. In this article, I would like to give you a quick overview of the novel PyHexad therapy and invite you to get involved (after consulting with your doctor).

To access the data in HDF5 files from Excel is a frontrunner among the all-time TOP 10 most frequently asked for features. A spreadsheet tool might be a convenient window into, and user interface for, certain data stored in HDF5 files. Such a tool could help overcome Excel storage and performance limitations, and allow data to be freely “shuttled” between worksheets and HDF5 data containers. PyHexad ([4],[5],[6],[7]) is an attempt to further explore this concept.  

It helps to look at related concepts, such as MATLAB[9] and PyTables[10], but perfect requirements are hard to come by, and we have to console ourselves with a few guiding principles for this exploration[8]:

1. The problem statement is rather vague and not necessarily the best, or even the right one.
2. For a prototype, choose relatively independent elements with low external complexity and (potentially) high internal complexity.
3. Keep it simple.
4. Build in and maintain options; there is no need for making decisions early.

Also, since this is an exploration, cabin comfort is not the top priority and there is hardly a best spot on the ship for the seasick. Wait for the forthcoming holidaymaker special on PyHexad Cruises!

At the moment, PyHexad appears in Excel as a set of about a dozen user-defined functions and includes endpoints for displaying file contents, reading and writing arrays, tables, and attributes, as well as displaying images stored in HDF5 datasets. (See Figure 1.) Readers familiar with MATLAB’s high-level functions[9] for HDF5 should feel very much at home.

Category Functions
Files and Objects h5showtree, h5showList, h5getInfo, h5newFile, h5newGroup
Arrays h5readArray, h5writeArray, h5newArray
Tables h5readTable, h5writeTable, h5appendRows, h5newTable
Images h5readImage
Attributes h5readAttribute, h5writeAttribute

Figure 1.

Before discussing some of the functions, let’s briefly revisit the problem and maybe eliminate a misconception or two.

Excel is not another HDF5 Viewer

Most of the word “information” contains the word “inform,” so I call things information only if they inform me, not if they are just collections of data, of stuff.  (R.S. Wurmann, 1996)

Don’t get me wrong. We need good HDF5 viewers and editors and more. Excel might make an ok HDF5 viewer or editor, but that’s not where it shines. (How could you run a country on something that’s a mediocre HDF5 viewer anyway?) A well done Excel workbook delivers analysis-based, decision-oriented information, it informs, it tells a story, it doesn’t just show you stuff. Contrast that with some HDF5 viewers out there (see Figure 2).

Figure 2. Screen captures of Excel, HDFView, ViTables and HDF Compass.

Maybe the principal requirement for PyHexad can be summed up like this:

PyHexad shall facilitate the delivery of decision-oriented information in Excel that is based on the analysis of data stored in HDF5 files.

That can still mean a lot of different things, but it makes it clear that the accessing or viewing of data stored in HDF5 files, per se, misses the point (for Excel).

PyHexad Primitives

The three core PyHexad primitives are Array, Table, and Image. They are spelled using small caps as a reminder that they might not be exactly what you think they are. Except for Image they don’t even have a direct counterpart in HDF5. An overview of how PyHexad primitives are represented in Excel and in HDF5 is shown in Figure 3.

Figure 3.

Figure 3.

PyHexad supports the reading and writing of Arrays and Tables, as well as the creation of new Arrays and Tables, whereas Images are currently supported only for reading. Arrays and Tables can be read or written partially, i.e., only subsets of elements (hyperslabs) can be targeted and, for Tables, selections or updates can be restricted to a subset of columns (compound fields). All functions support the Excel-typical notations for array literals ({720,1440}), referencing other cells ($A1) and cell ranges ($A1:$F12345).

HDF5 attributes and groups do not have direct counterparts in Excel, but can be examined and manipulated using the functions mentioned in Figure 2.

Finally, there are three functions to retrieve information about the content of an HDF5 container. In Figure 4, the outputs of the h5showList and h5showTree functions (for the same file) are shown. The former provides a list view of objects grouped by their “parent group,” whereas the latter mimics a hierarchical view via indentation, a poor man’s tree view.

Figure 4.

To get a better sense of the current PyHexad functionality in action, I invite you to watch the YouTube movies[7], take a look at the User’s Guide[5], and, if you’re brave enough, try it yourself.

In summary, our preliminary result, is the “layer cake” shown in Figure 5.

Figure 5. PyHEXAD layers

Its main ingredients are Python’s strength as a glue language and the relegation of the main sources of complexity, HDF5’s C-API and the Excel COM API, to the h5py[2] module and the PyXLL[3] package (distributed by Enthought, Inc.), respectively.

As in so many cases, there are other options. See for example[11], for an interesting approach based on Visual Studio Tools for Office and ILNumerics.

Food for Thought

What you see in PyHexad is not equivalent to all we thought about, but some of maybe the (we hope) better ideas, which might help us to enter into a conversation with potential users.

Here are a few questions for you:

1. The current version of PyHexad is by and large stateless in the sense that, other than symbolically (file names, path names), no reference between a Workbook and underlying HDF5 files is maintained. Depending on one’s perspective this might be a benefit or an issue. What do you think?

2. There’s plenty of room for UI embellishments, file dialogs, tree-views, etc. How much do they help to improve the user-experience and where do they start to get in the way?

3. Maybe this is a little gimmicky, but someone might want to create a single (archival) package, which would include the Excel Workbook front-end and an HDF5 data store. The Office Open XML SpreadsheetML file format (.xlsx) is actually a ZIP package, and one might contemplate somehow sneaking the HDF5 file into that package. Since the “heavy-weight” data will usually be stored in HDF5, this may not be such a good idea: unlike compression in HDF5, which works at the dataset level, the entire ZIP package needs to be (de-)compressed. Embedding the Workbook into the HDF5 file’s userblock might be a more practical approach.

4. The same way an IPython Notebook[16] can read data from a Web service endpoint, such as OPeNDAP[15], wouldn’t it be nice if h5[read,write]Array could read/write data directly from an HDF server[14] endpoint?

5. PyTables[10] is probably the most successful community-supported representation of tabular structures and time series in HDF5. What do you think about representing Tables as PyTables? That way we could also “lift” its query-ability to the Excel level…

Finally, does PyHexad facilitate the delivery of decision-oriented information in Excel that is based on the analysis of data stored in HDF5 files? We have not intentionally done anything to subvert this goal, but the honest answer is, we don’t know.

You tell us! (Have you noticed the “Leave a reply…” below, or under the title?)

References
[1] PyHexad announcement – https://support.hdfgroup.org/news/announcements/AnnouncingPyHexad.pdf

[2] HDF5 for Python – http://www.h5py.org/

[3] Python for Excel with PyXLL – https://www.enthought.com/products/pyxll/

[4] PyHexad on GitHub – https://github.com/HDFGroup/PyHexad.git

[5] PyHexad User’s Guide – http://pyhexad.readthedocs.org/en/latest/index.html

[6] PyHexad on PyPI – https://pypi.python.org/pypi/pyhexad

[7] PyHexad on YouTube – https://www.youtube.com/watch?v=jRcUz9iGZJo&list=PLPyhR4PdEeGaqUt-tGdFLMLUw0na8LD6h

[8] Mark W. Maier and Eberhardt Rechtin, The Art of Systems Architecting, Third Edition, CRC Press 2009.

[9] http://www.mathworks.com/help/matlab/high-level-functions.html

[10] PyTables – http://www.pytables.org/

[11] ILNumerics Blog, Fun with HDF5, ILNumerics and Excel, 29 November 2014.

[12] HDF5 Image and Palette Specification Version 1.2, September 2005.

[13] HDF5 Table Specification Version 1, May 2002.

[14] h5serv – REST-based service for HDF5 data.

[15] OPeNDAP – http://www.opendap.org/

[16] IPython – http://ipython.org/

No Comments

Leave a Comment