Putting some Spark into HDF-EOS

Gerd Heber and Joe Lee, The HDF Group

In an earlier blog post [3], we merely floated the idea of bulk-processing HDF5 files with Apache Spark. In this article, we follow up with a few simple use cases and some numbers for a data collection to which many readers will be able to relate.

If the first question on your mind is, “What kind of resources will I need?”, then you have a valid point, but you also might be the victim of BigData propaganda. Consider this: “Most people don’t realize how much number crunching they can do on a single computer.”

HDF HDF-EOS: EOS Satellite Image courtesy of Jesse Allen, NASA Earth Observatory/SSAI
Aura: “A mission dedicated to the health of the earth’s atmosphere” using HDF technologies.  EOS Satellite Image courtesy of Jesse Allen, NASA Earth Observatory/SSAI

“If you don’t have big data problems, you don’t need MapReduce and Hadoop. It’s great to know they exist and to know what you could do if you had big-data problems.” ([5], p. 323)  In this article, we focus on how far we can push our personal computing devices with Spark, and leave the discussion of Big Iron and Big Data vs. big data vs. big data, etc. for another day.  Continue reading

From HDF5 Datasets to Apache Spark RDDs

Gerd Heber, The HDF Group

“I would like to do something with all the datasets in all the HDF5 files in this directory, but I’m not sure how to proceed.”

If this sounds all too familiar, then reading this article might be worth your while. The accepted general answer is to write a Python script (and use h5py [1]), but I am not going to repeat here what you know already. Instead, I will show you how to hot-wire one of the new shiny engines, Apache Spark [2], and make a few suggestions on how to reduce the coding on your part while opening the door to new opportunities.

But what about Hadoop? There is no out-of-the-box interoperability between HDF5 and Hadoop. See our BigHDF FAQs [3] for a few glimmers of hope. Major points of contention remain such as HDFS’s “blocked” worldview and its aversion to relatively small objects, and then there is HDF5’s determination to keep its smarts away from prying eyes. Spark is more relaxed and works happily with HDFS, Amazon S3, and, yes, a local file system or NFS. More importantly, with its Resilient Distributed Datasets (RDD) [4] it raises the level of abstraction and overcomes several Hadoop/MapReduce shortcomings when dealing with iterative methods. See reference [5] for an in-depth discussion.

Figure 1.  A simple HDF5/Spark scenario
Figure 1. A simple HDF5/Spark scenario

As our model problem (see Figure 1), consider the following scenario: Continue reading