Putting some Spark into HDF-EOS

Gerd Heber and Joe Lee, The HDF Group

In an earlier blog post [3], we merely floated the idea of bulk-processing HDF5 files with Apache Spark. In this article, we follow up with a few simple use cases and some numbers for a data collection to which many readers will be able to relate.

If the first question on your mind is, “What kind of resources will I need?”, then you have a valid point, but you also might be the victim of BigData propaganda. Consider this: “Most people don’t realize how much number crunching they can do on a single computer.”

HDF HDF-EOS: EOS Satellite Image courtesy of Jesse Allen, NASA Earth Observatory/SSAI
Aura: “A mission dedicated to the health of the earth’s atmosphere” using HDF technologies.  EOS Satellite Image courtesy of Jesse Allen, NASA Earth Observatory/SSAI

“If you don’t have big data problems, you don’t need MapReduce and Hadoop. It’s great to know they exist and to know what you could do if you had big-data problems.” ([5], p. 323)  In this article, we focus on how far we can push our personal computing devices with Spark, and leave the discussion of Big Iron and Big Data vs. big data vs. big data, etc. for another day.  Continue reading