Highly Scalable Data Service: Advancing Energy Innovation - The HDF Group - ensuring long-term access and usability of HDF data and supporting users of HDF technologies

The following is an excerpt from an National Renewable Energy Lab (NREL) press release.

NREL Releases Major Update to Wind Energy Dataset

May 8, 2018

A massive amount of wind data was recently made accessible online, greatly expanding the amount of information available on wind flow across the continental United States.

The data from the Energy Department’s National Renewable Energy Laboratory (NREL) enables anyone considering building a wind plant, or even erecting a single turbine, to understand how strong breezes tend to blow across a particular area and how energy from the wind can be integrated into the electrical grid.

Wind turbines stretch to the horizon on this property in Iowa.

NREL is making available massive amounts of data that can help determine where to install wind turbines, such as these in Iowa. (Photo by Dennis Schroeder / NREL)

Originally released in 2015, the Wind Integration National Dataset—also known as the WIND Toolkit—made 2 terabytes (TB) of information available, covering about 120,000 locations identified using technical and economic considerations. The newly released subset holds 50 TB, or 10 percent of the entire database, covers 4,767,552 locations, and extends 50 nautical miles offshore. Small sections of Canada and Mexico are included as well.

“The entire dataset is 500 terabytes,” said Caleb Phillips, a data scientist at NREL. “This is far and above the largest dataset we work with here at NREL.”

The data was always available, just not easily or in a simple, usable form. To make the information readily accessible, NREL utilized its ongoing relationships with Amazon Web Services (AWS) and The HDF Group. Having the dataset hosted on AWS will remove previous limitations on the amount of information that can be accessed readily online.

“What we’ve tried to do is make this really easy, so folks can play with the data and use it to better understand the potential for wind resources at a greater number of locations,” said Phillips. “They can download only the data they want.” An interactive online visualization lets users interact with the data.

The HDF Group developed the Highly Scalable Data Service (HSDS) using the AWS cloud to provide users with easy access to the data, which is stored as a series of HDF5 files. The information can be narrowed to a specific site or time and analyzed using either a custom software solution or the Amazon Elastic Compute Cloud (Amazon EC2).

“We are very excited to work with both NREL and AWS to make their large, technical data sets more accessible through our new scientific data platform, HDF Cloud,” said David Pearah, CEO of The HDF Group. “Our work aims to pave the way for large repositories of scientific data to be moved to the web without compromising query performance or resources.”

The WIND Toolkit provides barometric pressure, wind speed and direction, relative humidity, temperature, and air density data from 2007 to 2013. These seven years of data provide a detailed view of the U.S. wind resource and how it varies minute to minute, month to month, and year to year. These historical trends are essential for understanding the variability and quality of wind for power production. The simulated results were computed by 3Tier under contract for NREL using the Weather Resource Forecast model.

“Now that we have a data platform that supports release of large data sets, we hope to use this capability to release other big data as well that were previously considered too large to make publicly available,” Phillips said. Coming online next are solar irradiance data and wind data for Mexico, Canada, and potentially other countries. “We are thrilled to make these datasets available, allowing researchers to more easily find and use the data, as well as reducing costs for the national laboratory.”

While measurements across the rotor-swept areas are the best way to determine wind conditions at a site, that’s not always possible. The WIND Toolkit provides an estimate, but actual conditions can be validated using on-site measurements as required.

The first release of data prompted regular calls from people in academia, industry, and government wanting additional information. The federal Bureau of Oceanic Energy Management contracted with NREL to provide additional information for offshore areas. The WIND Toolkit Offshore Summary Dataset was made publicly available last year.

The original work to develop and release the WIND Toolkit was funded by the U.S. Department of Energy, Office of Energy Efficiency and Renewable Energy, Wind Energy Technologies Office.

NREL is the U.S. Department of Energy’s primary national laboratory for renewable energy and energy efficiency research and development. NREL is operated for the Energy Department by The Alliance for Sustainable Energy, LLC.

Learn more about the HDF Group’s product behind NREL’s public data release in this interview with John Readey, the principal architect behind the new service

What’s possible with the Highly Scalable Data Service?

The Highly Scalable Data Service (HSDS) is an open source project developed by The HDF Group. The idea of HSDS is to enable “HDF as a Service”, i.e. all the functionality that people expect from the HDF5 library, but accessible via a REST API, rather than in-process calls to the library.

Why is this important?

By enabling the HDF client to access data indirectly (i.e. via a service) there are several advantages:

The server can run on a large machine (or even a cluster), providing more memory and network bandwidth to the storage system. HSDS utilizes all available cores on the server to optimize performance.
Clients that are “far away” from the data (say your laptop accessing data stored on AWS S3), can take advantage of an HSDS instance running “close” to the data (e.g. running in the same AWS data center as the S3 content). The bulk of the data transfer needed to query HDF data can happen between the server and S3, while only a relatively small amount of data needs to be transmitted to the client.
The server can mediate actions between multiple clients. This enables HSDS to support multiple writers/multiple readers (MWMR) without danger that the file will be corrupted or data being overwritten.

In summary, HSDS provides additional capabilities and facilitates a shift to a cloud-native application model?

What do you mean by “cloud-native”?

Since the advent of the cloud, there’s been a shift from monolithic applications to what is known as a “micro-services” model. The latter approach realizes larger applications as a set of smaller components that communicate using HTTP. This components can be run in a cluster management system such as Kubernetes. In this type of environment, HSDS becomes just another peer service that enables HDF data access regardless of where the server and client are running.

Setting up a service seems complicated or will take a lot of time. Is this a problem?

We provide installation guides for AWS, Azure, and on-prem installs. Typically you should be up and running in an hour or so. For Azure users, it is even easier if they utilize the Azure Marketplace offering “HSDS Azure VM”. This provides a pre-built image that enables users to be up and running in minutes.

Beyond HSDS, are their other related projects managed by the HDF Group?

Yes, in addition to HSDS, we created a Python client for the HSDS Rest API – h5pyd. This package provides an API compatible with the popular h5py package. This enables existing Python applications to easily switch from access content on HDF5 files to accessing content provided by HSDS.

In addition, there’s a plugin for the HDF5 library, HDF REST VOL, that enables C/C++ application to similar switch between local files and the serviced without needing changes in the application code.

Finally, there’s HDF Lab. HDF Lab is a Jupyter Lab based environment that provides an interactive environment where users can explore HDF by provided datasets, tutorials, and example programs. HDF Lab users are provided with a HSDS account that they can use to access PB’s of data provided by Amazon’s public dataset program.

Q: NREL is using HSDS to provide access to NREL’s data release of 50 TB of climate data. That’s big, but not that big. Tell me what you think is possible for HSDS?

50 TB may not seem that big, but many people may not realize is that the capacity of computer storage has increased much faster than the ability to transmit data over the internet. For example, if you tried to download 50TB using a typical broadband connection, it would take you the better part of the year. So, the ability for NREL’s users to grab the data they need almost immediately is quite an advantage. And the amount of data available has only increased. For example, since the blog article was published, NREL has provided another 200TB of data accessible via HSDS. So, the reality is that the principle of “data inertia” – i.e. that it’s incredibly hard to move these large data collections around, mean it’s much feasible to move the “code to the data” rather than moving data to the code – i.e. downloading data files to your laptop is just not that practical anymore.

It can handle big files, but what about speed?

So that’s the thing that makes HSDS special. Internally HDF5 datasets are organized as “chunks” – equal sized tiles that sub-divide the dataset space. Traditionally with the HDF5 library, when you need to select a slice of data from a dataset each chunk that is accessed has to be processed sequentially. With HSDS, these operations can be handled parallelly, greatly speeding up read and write operations.

Further, if you need to look at data in that same chunk again, it’s likely to have been cached by the server. It’s there for you to access without pulling it across the network again.

You’re getting your data faster, and if you’re in an environment where you’re paying for that transfer like in egress charges, you’re saving money.

How does performance compare with accessing HDF5 files using the HDF5 library?

This depends greatly on the application, how the data file is setup, and computer hardware. For many use-cases HDF5 library access will be much faster since there is no inter-process communication required. In other cases, HSDS will be faster, for example, when the advantage of the HSDS parallel computing model out-weigh the out of process overhead compared with HDF5 library usage. In the end, the best thing to do is to try HSDS with your data and application and to understand what kind of performance you will see.

A common challenge with client-server applications is managing the capacity of the server with the client load. How does HSDS deal with this?

If your application has an expected peak load, I’d recommend setting up HSDS on machines with varying amounts of memory and number of cores. If the workload on the server starts to exceed what HSDS can handle, it will start rejecting some requests with a “503 – Service Unavailable” error. The h5pyd client will see the errors and retry the request after a short timeout, but these responses are a sign your service is underscaled.

If a single machine doesn’t meet your needs, you can run HSDS on a Kubernetes cluster (AKS for Azure and EKS for AWS are supported as well as self-managed clusters). Beyond enabling higher workloads, HSDS on Kubernetes enables the administrator to easily scale capacity by adjusting the number of nodes (Kubernetes pods) as needs change.

Finally, HSDS for AWS Lambda provides a completely serverless option. Rather than a server that runs 24/7, Lambda functions are invoked for each HTTP request. No server to manage, and Lambda supports up to 1000 simultaneous invocations. This approach is especially good for high-dynamic workloads. There is higher latency though since it takes about 2 seconds for the Lambda function to “spin up”.

You mention HSDS being used in an on premise installation or installed on a commercial cloud where there would be egress charges. Where can people use HSDS?

HSDS can be used on prem or on AWS, or Azure (Google Cloud support is coming). Supported storage includes Posix disk, AWS S3, or Azure Blob Storage. Given that many HDF users are dealing with large data collections and heavy compute jobs, it can be quite expensive to move their software to the cloud. What seems common among many HDF users is that they are thinking about the cloud, want to be ready for the cloud, but will not be moving to the cloud right away. This is where the “cloud native” approach I mentioned early seems especially relevent. It’s quite feasible to develop cloud native applications using on-prem infrastructure – say a Kubernetes cluster along with an S3-API compatible storage system. This approach provides the flexibility to stay on-prem, move to the cloud, or adopt a hybrid approach as circumstances warrant.

How can The HDF Group help users who have questions on how to best use HSDS or have general questions about HDF?

Everyone is encouraged to ask questions on the HDF forum or send a email to help@hdfgroup.org. The HDF Group also offers consulting services for users who need more specialized help. In our experience consulting, we find one of the biggest ways we could improve the use of HDF5 was to be there at the beginning, helping with the setup. For organizations that want to expedite this process, and make sure all the decisions made in the beginning help give them the fastest and most efficient system but for whatever reason can use the Azure marketplace install, The HDF Group also offers consulting for setup and beyond.

What does this product mean for The HDF Group?

The Highly Scalable Data Service and the consulting offerings are an extension of what The HDF Group has always done and is part of their mission. It’s about continuing the accessibility of HDF-stored data, as that data moves into cloud storage or wherever. The HSDS Azure VM can be a quick install for a researcher or a small group to grab and use for quick access of their data. And finally, HDFLab is a convenient exploration and collaboration tool that might have big benefits for users for a minimal cost, and that’s after the free trial. We’re also looking at how we might offer HDF Lab free to certain populations, like students, or with sponsorship, just make it free for everyone.

If you’re ready to start a conversation with a sales engineer about how HSDS might work for you, please contact us.