RDD Tag

...we focus on how far we can push our personal computing devices with Spark. It consists of 7,850 HDF-EOS5 files covering 27 years and totals about 120 GB. We use a driver script, which reads a dataset of interest from each file in the collection, computes per-file quantities of interest, and gathers them in a CSV file for visualization. The processing time on our reference tablet machine for 3.5 years of data using 4 logical processors was about 10 seconds....

Gerd Heber, The HDF Group

Editor’s Note: Since this post was written in 2015, The HDF Group has developed HDF5 Connector for Apache Spark™, a new product that addresses the challenges of adapting large scale array-based computing to the cloud and object storage while intelligently handling the full data management life cycle. If this is something that interests you, we’d love to hear from you.

“I would like to do something with all the datasets in all the HDF5 files in this directory, but I’m not sure how to proceed.”

If this sounds all too familiar, then reading this article might be worth your while. The accepted general answer is to write a Python script (and use h5py [1]), but I am not going to repeat here what you know already. Instead, I will show you how to hot-wire one of the new shiny engines, Apache Spark [2], and make a few suggestions on how to reduce the coding on your part while opening the door to new opportunities.

But what about Hadoop? There is no out-of-the-box interoperability between HDF5 and Hadoop. See our BigHDF FAQs [3] for a few glimmers of hope. Major points of contention remain such as HDFS’s “blocked” worldview and its aversion to relatively small objects, and then there is HDF5’s determination to keep its smarts away from prying eyes. Spark is more relaxed and works happily with HDFS, Amazon S3, and, yes, a local file system or NFS. More importantly, with its Resilient Distributed Datasets (RDD) [4] it raises the level of abstraction and overcomes several Hadoop/MapReduce shortcomings when dealing with iterative methods. See reference [5] for an in-depth discussion.

HDF5 Apache Spark RDD

Figure 1. A simple HDF5/Spark scenario

As our model problem (see Figure 1), consider the following scenario: