Tests of the NCSA h4toh5 utility with NASA Datasets

Robert E. McGrath
mcgrath@ncsa.uiuc.edu
NCSA

August, 2001


Abstract

This paper reports tests of the h4toh5 utility using real NASA data.  The utility converted files, usually in less than one second per file.  These results show that conversion of HDF4 files to HDF5 is feasible for large collections or on demand.

Contents


1. Introduction

Since HDF5 is not backward compatible with earlier versions of HDF, many users must transition from HDF4 to HDF5.  This transition may require rewriting software and possibly rewriting datasets. The details depend on the goals and situation of the users.

NCSA has published a default mapping of HDF4 objects to HDF5 [1].  This mapping provides guidance and recommendations for how HDF4 files and objects should be represented in HDF5.  Of course, users may wish to do something other than the default, in order to take best advantage of HDF5.

The h4toh5 utility is provided as part of the HDF5.1.4.2 release [2].  This tool converts one HDF4 file to an equivalent HDF5 file, using the HDF4 to HDF5 mapping [1].  It is important to realize that this is a default conversion, which may not preserve some of the 'semantics' of the HDF4 data, particularly if the file is complicated.  If the default conversion is not adequate for some purpose, it may serve as an example or prototype from which to construct a customized conversion.

This experiment tested the h4toh5 utility using NASA HDF4 datasets as input. The goal is to test the utility with a set of real files, to assure that it works, and to assess it's performance.

The input data was all real NASA science data, provided to the public in HDF4.  Thus, this is a realistic test in that it used real data.

Some of the NASA datasets were created with the HDF-EOS library (using HDF4).  It is important to realize that the h4toh5 converter does not 'understand' HDF-EOS.  In this case, the native HDF4 components of Grids, Swaths, etc., were converted into HDF5 objects.  The result is definitely not a legal HDF-EOS5 file.  The HDF-EOS5 library stores the HDF-EOS objects in ways that are optimized for HDF5, which are not the default translation of how they were stored in HDF4.

Therefore, the conversions performed on these datasets should be seen as a demonstration that conversion is feasible, although custom conversions may well be needed to create the desired HDF5 file.


2. Method

2.1. Description of Data

A sample of NASA datasets were acquired from DAACs and DIAL. All were obtained from public sources of sample or real data.  These datasets included data from approximately 33* data products, from 8 instruments (avhrr, ACRIM, CERES, MISR, MODIS, MOPITT, SSMI).  The total number of granules (HDF4 files) was 1776.  The data sets are listed in the Appendix.

This sample of data was chosen arbitrarily from what could be obtained from DIAL [7] and from public FTP services at DAACs. Therefore, it is not statistically representative of NASA data or of any specific body of data.  However, it is all real or sample data from NASA.

2.2. Experimental Environment

The experiment was run on a dual 550 MHz Pentium III, Linux 2.2.18smp, using a local disk. The h4toh5 utility is from the HDF5.1.4.2 release (August 2001), uses HDF4.2r3. The sizes are from the Linux file system, and times were collected with the system 'time' function.

Each file was converted at least 5 times, with the average time reported.

The output files were visually inspected with the H5View utility [3] and compared to the original HDF4 file (using the JHV tool [4]).


3. Results

All but four files were successfully converted by the h4toh5 utility. The resulting HDF5 files contained all the data from the HDF4 file.  The conversions were extremely fast, in all cases the conversion was faster than the download from the original data server.

3.1. Failures

Two files could not be converted by the converter. These files exposed bugs in the converter, which will be fixed in the next release. These are included in the data.

Two files were damaged in transfer and could not be used. These are excluded from the data.

3.2. Converted File Size

With one exception, the HDF5 files were all within a few percent of the same size as the HDF4 file, with HDF5 usually slightly smaller.  The exception was one group of browse files which contain compressed images.  These are stored uncompressed in the HDF5 file, resulting in a 3 times larger file.  In the future, the h4oth5 conversion will apply compression when used in the HDF4 file, and then the compressed HDF5 file should be approximately the same size as the HDF4 file in all cases.  Of course, individual cases will vary.

3.3. Conversion Times

Of the 1776 files tested, 98% were converted in 1 second or less.  The longest conversion time observed was 332 seconds (5 minutes 32 sec) (for a 117 MB HDF4 file). Only 4 files averaged longer than 1 minute conversion (117MB, 781MB, 75 MB and 68 MB respectively).  Table 1 shows these statistics.  Figure 1 shows a histogram of the top 20.

Table 1.  File Conversion Time**
Highest single time of 1776 datasets 332 s (MOP02-19970814-L2V0.1.1.hdf)
Average time > 1 min 4 out of 1776 datasets (.2%)
Average time > 1 sec 19 (1.1%) datasets
Average time < 1 sec 98.6% datasets
(**Based on 1776 granules.  2 granules could not be used because of errors in downloading.)


Figure 1. Longest 20 Conversion times.

In general, larger files take more time to convert, but the relationship is not simply linear with the size of the file.  There wasn't enough variability to analyze this.  Generally, any file under 4 MB was converted in much less than 1 s.  Figure 2 shows a scatter plot for the 1776 files.


Figure 2.  Summary of conversion times by the size of the original HDF4 file.


4. Conclusions

These tests show that the h4toh5 utility works reliably on a variety of real data, including data written by older versions of HDF4. The small number of failures have been traced to a couple of bugs in the converter which will be fixed in the next release.

The output files appear to contain all the data, and are readily recognized as faithful (if simpleminded) translations of the HDF4 original.  The files have similar size to the HDF4 originals, as should be the case.

In the past,  there has been considerable concern about the performance of the conversion utility and similar custom programs.  This test shows that performance is not a problem, even using a fairly inexpensive PC. Of course, these figures would be much slower if a network disk is used, or on a slower system.

The speed of the conversion is, of course, related to the size of the files.  However, the slowest conversion was not the largest file.  It is possible that the speed of conversion is limited by the size of the largest object, or the number of objects, or some combination of factors.  This cannot be determined from this limited sample of data. Also, the next release of the h4toh5 utility should have even better performance, especially for large objects.

This study shows that converting HDF4 files into HDF5 is feasible, even for files in the range of 800 MB. With conversion times of a few seconds to a few minutes, it is clear that whole archives could be converted in a few hours. Alternatively most data could be converted "on the fly" when requested from a server.

As discussed above, the h4toh5 utility would not be appropriate to convert HDF-EOS files.  The heconvert tool [5] will provide a similar conversion for HDF-EOS files.  The heconvert tool could not be tested for this study, because it is not available for Linux.

For some data products, default conversions may not be sufficient.  A custom converter would be needed, perhaps using functions from the libh4toh5***  to convert individual objects. This study shows that once created, a custom converter should work reliably and efficiently.

Overall, this study suggests that converting files from HDF4 to HDF5 is technically viable.


5. Notes

*In some cases I'm not certain what officially counts as a 'data product'.  There are 33 different 'kinds' of HDF4 file, with many instances (e.g., multiple days or months) in some cases.

***The NCSA libh4toh5 is a library of C functions to perform a default conversion of individual HDF4 objects. This library will be available in Fall 2001.


6. Acknowledgments

This report is based upon work supported in part by a Cooperative Agreement with NASA under NASA grant NAG 5-2040. Any opinions, findings, and conclusions or recommendations expressed in this material are those of the author(s) and do not necessarily reflect the views of the National Aeronautics and Space Administration.

Other support provided by NCSA and other sponsors and agencies [6].


7. References

1. Mike Folk, Robert E. McGrath, and Kent Yang, "Mapping HDF4 Objects to HDF5 Objects",
H4toH5Mapping.pdf.

2.  H4toh5 utility

3.  Java HDF5

4.  Java HDF

5. HDF-EOS and Related Software

6. Acknowledgments

7. DIAL


Appendix: Data Sources


The data was obtained from publicly available NASA datasets.
 
Data From DIAL Observation Date Description
avhrr8kmmonthly 1993 AVHRR 8KM 10 Day Composites - Southeast Asia
avhrr1km10day (transfer failed--not converted) 1993 AVHRR
avhrr8km10day (transfer failed--not converted) 1993 AVHRR
tahoe-north-middle 1998 ASTER L1BT, Lake tahoe
ASTL1B_000830185 1998? ASTER L1 test?
CER_ES8_Terra-FM2_Test_SCF_016011.20000830.
subset_70_20_-140_-40.20001012_204110Z
1998?
CER_ES4_Terra-FM1_Beta_015013.200004 2000 Preliminary data (do not use)
CER_FSW_TRMM-PFM-VIRS_Sample_000000.199801Z06 2000 Preliminary data (do not use)
98034001632_GOES08_IMAGER 1998? GOES Imager?
MISR_AM1_AS_LANDSFC_P027_O000027_01.dw 1996 Prelaunch Land surface
MISR_AM1_AS_AEROSOL_P027_O000027_01_dw 1996 Prelaunch aerosols over ocean
MOD021KM.A2000080.1815.002.2000083151033 2000 Hudson's Bay
MOD02HKM.A2000242.0140.002.2000247230108 1999 MODIS Preliminary
NISE_SSMIF11_19911227 1999 Ice and Snow
ballon_sp 1996-1997 HDFEOS Point data for balloon launches
misr_l1a_ccd_df.new.nominal 2001 MISR L1A
MOP02-19970814-L2V0.1.1 1997 Sample MOPITT L2

 
Data From DAACS Source Observation Date Description
MOAPWBM1.P1.ADD2000321.002.2001034035708 Goddard 2000 MODIS L4 ocean data
MOAPWBM2.P2.ADD2000321.002.2001034035718 Goddard 2000 MODIS Ocean Level 4 data
MOAPWBME.PAR.ADD2000321.002.2001034035728 Goddard 2000 MODIS L4 ocean data
MOD03.A2000106.1540.001.2000109075312 Goddard 2000 MODIS radiometric geolocation
MOD03.A2000110.0220.002.2000193195357 Goddard 2000 MODIS radiometric geolocation
MOD08_E3.A2000337.002.2001037044240 Goddard 2000 Sample MODIS L3
 C1986151201607.L2_BRS PODAAC 1986 Coastal Zone Color (hourly--all data for May, 827 granules)
 C1986151201607.L2_GAC PODAAC 1986 Coastal Zone Color (hourly--all data for May, 827 granules)
f13_Tb_01220_01D MSFC 2001 SSMI Brightness temperature
(All 29 passes, 1 day)
f13_hn_01220_01D MSFC 2001 SSMI geolocation
(All 29 passes, 1 day)
f13_ln_01220_01D MSFC 2001 SSMI geolocation
(All 29 passes, 1 day)
Other MODIS data
MOD04_L2.A2000242.0140.002.2000264223516
MOD04_L2.A2000243.1850.002.2000252164712
MOD05_L2.A2000243.1850.002.2000252164414
MOD06_L2.A2000243.1850.002.2000252173103
MOD35_L2.A2000243.1850.002.2000244222700