2024 HDF5 User Group Meeting Agenda - The HDF Group - ensuring long-term access and usability of HDF data and supporting users of HDF technologies

Agenda

2024 HDF5 User Group Meeting (HUG24)
August 5-7, 2024
McCormick Tribune Campus Center at the Illinois Institute of Technology, 3201 S. State, Chicago, IL.

You can find full information for our conference on the conference website. Slide Decks and video recordings are be linked below. You can also watch all of the videos as a Youtube playlist. This year, slide decks will also be added to Zenodo when final. Check out our Zenodo community.

Monday, August 5, 2024

8:00-9:00 AM – BREAKFAST

9:00-9:10 AM – Welcome Address – Suren Byna, The Ohio State University
Slide deck | Video

9:10-9:35 AM – Hermes: A Heterogeneous-Aware Multi-Tiered Distributed I/O Buffering System – Luke Logan, Research Software Engineer at Gnosis Research Center
Slide Deck | Video

Modern High-Performance Computing (HPC) systems are adding extra layers to the memory and storage hierarchy, named deep memory and storage hierarchy (DMSH), to increase I/O performance. New hardware technologies, such as NVMe and SSD, have been introduced in burst buffer installations to reduce the pressure for external storage and boost the burstiness of modern I/O systems. DMSH has demonstrated its strength and potential in practice. However, each layer of DMSH is an independent heterogeneous system and data movement among more layers is significantly more complex even without considering heterogeneity. How to efficiently utilize the DMSH is a subject of research facing the HPC community. In this paper, we present the design and implementation of Hermes: a new, heterogeneous-aware, multi-tiered, dynamic, and distributed I/O buffering system. Hermes enables, manages, supervises, and, in some sense, extends I/O buffering to fully integrate into the DMSH. We introduce three novel data placement policies to efficiently utilize all layers and we present three novel techniques to perform memory, metadata, and communication management in hierarchical buffering systems. Our evaluation shows that, in addition to automatic data movement through the hierarchy, Hermes can significantly accelerate I/O and outperforms by more than 2x state-of-the-art buffering platforms.

9:35-10:00 AM – Distributed Affix-Based Metadata Search in Self-Describing Data Files – Wei Zhang, Ph.D, Lawrence Berkeley National Laboratory
Slide Deck | Video

As the volume of scientific data continues to grow, the need for efficient metadata search mechanisms becomes increasingly critical. Self-describing data formats like HDF5 are central to managing and storing this data, yet traditional metadata search methods often fall short, especially for affix-based queries such as prefix, suffix, and infix searches. In this talk, I will review the advancements made with DART (Distributed Adaptive Radix Tree) and IDIOMS (Index-powered Distributed Object-centric Metadata Search), two powerful solutions designed to address the challenges of distributed affix-based metadata searches in high-performance computing environments.

DART introduces a scalable, trie-based indexing approach that significantly improves search performance and load balancing across distributed systems. Building on this foundation, IDIOMS further optimizes metadata searches by integrating a distributed in-memory trie-based index and supporting both independent and collective query modes. Together, these systems demonstrate substantial performance improvements over traditional methods, making them highly effective for managing large-scale scientific data.

By leveraging the principles and methodologies behind DART and IDIOMS, we can envision a robust solution for affix-based metadata searches in distributed HDF5 environments. This talk will provide a comprehensive overview of the existing work, highlight key technical innovations, and discuss the potential for extending these techniques to support efficient, scalable metadata searches in self-describing data formats.

10:00-10:25 AM – Semantic Search and Natural Language Query over HDF5 – Chenxu Niu, Texas Tech University
Slide Deck | Video

The ability to effectively query HDF5 files is a prerequisite for fully leveraging their potential. Over the years, a series of lexical matching solutions have been proposed to address the metadata search problem of HDF5 files. However, these traditional lexical matching approaches often ignore the semantic relationship between the query and the actual metadata/data in the datasets. With such systems, users need a deep understanding of the format and structure of their data as well as their true intentions when finding data of interest. Therefore, it is necessary to provide a metadata querying mechanism that captures the semantic meaning of every query, bridging the gap between the true intentions of user queries and the actual data of interest.

Toward the goal of providing an advanced search service for HDF5 files, our research has progressed through several key stages to address different challenges. Initially, with kv2vec and PSQS (Parallel Semantic Querying Service), we moved beyond lexical matching to semantic search, focusing on capturing the semantics of keywords. Our method captures keywords from metadata attributes, enables the semantization of metadata, and performs semantic searches over scientific datasets. As our work evolved, we recognized the necessity of handling complete sentence inputs rather than solely keywords. This shift from keyword-based searches to full-sentence queries underscores the increasing complexity and capability of our methods. By leveraging large language models (LLMs), our new approach can process natural language queries and return the desired results on scientific files, significantly enhancing the efficiency of scientific data discovery and elevating scientific data management to a new level. This advancement holds substantial potential for revolutionizing scientific data discovery within the HDF5 community and beyond.

10:25-10:50 AM – Drishti VOL: The performance profiling and tracing HDF5 VOL connector – Jean Luca Bez, Lawrence Berkeley National Laboratory
Slide Deck | Video

Drishti is an interactive I/O analysis framework that seeks to close the gap between trace collection, analysis, and tuning by detecting common root causes of I/O performance inefficiencies and providing actionable user recommendations. In this talk, we demonstrate how Drishti can combine different sources of metrics and traces to provide a deeper understanding of I/O problems. Considering HDF5-based applications, we proposed the Drishti passthrough VOL connector to trace relevant high-level HDF5 calls that can be easily combined with other sources of I/O metrics, such as Darshan traces, to provide a cross-layer analysis and enhance the insights. We discuss our motivation and design choices and demonstrate how this HDF5 connector can aid in pinpointing the root causes of I/O performance bottlenecks.

10:50-11:00 AM – BREAK

11:00-12:00 PM – KEYNOTE: AuroraGPT: A Large-Scale Foundation Model for Advancing Science – Rajeev Thakur, Argonne National Laboratory
Slide Deck | Video

AuroraGPT is a new initiative at Argonne National Laboratory aimed at the development and understanding of foundation models, such as large language models, for advancing science. The goal of AuroraGPT is to build the infrastructure and expertise necessary to train, evaluate, and deploy foundation models at scale for scientific research, using DOE’s leadership computing resources. This talk will give an overview of AuroraGPT, efforts and accomplishments so far, and plans for the future.

Rajeev Thakur is an Argonne Distinguished Fellow and Deputy Director of the Data Science and Learning Division at Argonne National Laboratory. He received a Ph.D. in Computer Engineering from Syracuse University. His research interests are in high-performance computing, parallel programming models, runtime systems, communication libraries, scalable parallel I/O, and artificial intelligence and machine learning. He is a co-author of the MPICH implementation of MPI and the ROMIO implementation of MPI-IO. He was the director of Software Technology for DOE’s Exascale Computing Project from 2016 to 2017 and led the Programming Models and Runtimes area in ECP from 2016 to the end of the project in 2024. He is a Fellow of IEEE.

12:00-1:00 PM – LUNCH

1:00-1:25 PM – Optimizing molecular dynamics AI model using HDF5 and DYAD – Dr. Hariharan Devarajan, Lawrence Livermore National Laboratory
Slide Deck | Video

The Massively Parallel Multiscale Machine-Learned Modeling Infrastructure (MuMMI) is a framework designed to execute multiscale modeling simulations of large molecular systems integrated through ML techniques. The DL training portion of the MuMMI workflow uses NPZ arrays that lead to inefficient data loading characteristics such as I/O inefficiency, sample distribution, and sample coverage. In this talk, we will discuss our initial experience with integrating HDF5 and DYAD to improve the data access behavior of the training and improve sample distribution and sample coverage for the DL training. This talk will focus on three main directions. First, the challenges and experience of moving workloads towards HDF5. Second, features that would assist in adapting HDF5 for optimizing DL training. Finally, our experience in integrating the DYAD solution for optimizing I/O with HDF5 for this DL workload. In conclusion, we will demonstrate that we accelerated the MuMMI workflow using DYAD and HDF5 on the Corona cluster at LLNL.

1:25-1:50 PM– Analytical Data Platform: Divide & Conquer the Multi-Dimensional Gordian Knot – Donpaul C. Stephens, AirMettle, Inc.
Video

Today’s advanced scientific and medical equipment gathers a torrential amount of data in NetCDF4 and HDF5 data formats. While HDF5’s flexibility is invaluable in many individual applications, the high-volume applications typically leverage a more constrained subset of its capabilities. The high-volume use cases need to turn the information into actionable insights as the data directly impacts health, safety, quality, and reliability. In the quest to obtain this speed, some analysts have been considering alternative formats like Zarr for analysis – and potentially more of the data pipeline. Zarr explicitly exposes a determined amount of parallelism to the storage infrastructure (object/file system), which explicitly needs to maintain each of these internal components for external use.

AirMettle has taken a different approach; rather than exposing the complexity to the user to achieve high performance, we have focused on enabling users to get exceptional analytical performance from the NetCDF4 and HDF5 data they have today. We do not require them to transform their data while providing them with “server-less” APIs that enable them to request common data operations directly. This achieves many of the objectives of Zarr while providing additional functionality. Our solution efficiently stores data while extracting as much as 100x parallelism from large objects, allowing a much faster response time. This is achieved through not just multi-threading, but massively parallel distributed processing which efficiently only sends necessary results over the network. Our goal is human-scale interactive response, which appears attainable at any practical scale deployment.

In this talk, we will present the results of our work integrating multi-dimensional data analytics into our Analytical Data Platform. We maintained a consistent “Divide & Conquer” approach for enabling scalable analytics on user-friendly NetCDF4 and HDF5 as we previously did for record-oriented data types, including CSV, JSON, and Parquet. Our NetCDF4 work is currently in field trials, supporting sub-selection and “re-gridding” for re-scaling data. Our HDF5 support is in alpha, enabling UDF models to process 2D data frames for various applications, including labeling and feature extraction, which can enable “AI search” for the needle in massive data sets for the scientific community.

Join us to explore how the NetCDF4 and HDF5 data formats you have grown to love, can deliver the performance required to unlock a new generation of scientific discovery.

Our NetCDF4 work has been developed in collaboration with the University of Alabama in Huntsville, sponsored by NOAA. Our HDF5 work has been developed in collaboration with the University of Chicago, sponsored by the Department of Energy.

1:50-2:15 PM – Rebasing For The Win! – Jay Lofstead, Sandia National Laboratories
Slide Deck | Video

HDF5 has incorporated many performance improvements over the years. Each new release adds new functionality and potentially addresses user pain points. However, any feature that required a change in the file format that would make forward compatibility from early library versions is turned off by default. This continued forward compatibility leads many users to seek alternatives not understanding that the advances HDF has published in the literature are not enabled by default. This talk will argue that a clean break with the past will greatly improve HDF5’s reputation while not affecting the current user community.

2:15-2:40 PM – Upcoming new HDF5 feature: our progress on HDF5 multi-threading and more – John Mainzer and Elena Pourmal – Lifeboat, LLC
Slide Deck | Video

Over the past three years Lifeboat, LLC received DOE SBIR grants to enhance HDF5 and support multi-threaded VOL connectors, sparse and variable-length data storage, and encryption of HDF5 files. In our talk we will report on the progress we have made on the new features and will share our experience of working with the existing HDF5 code while creating non-trivial extensions to the software.

2:40-2:55 PM – BREAK

2:55-3:20 PM – New datatypes in HDF5 – Jordan Henderson, The HDF Group
Slide Deck | Video

This presentation will discuss new datatypes that are, or will be, supported in HDF5, including the “_Float16” 16-bit floating point datatype and complex number datatypes. There will also be brief discussion on future support for other datatypes that may be useful.

3:20-3:55 PM – HDF5: State of the Union – Dana Robinson, The HDF Group
Slide Deck | Video

In this talk, the HDF Group Director of Engineering will discuss the status of the HDF Group products, community engagement, and future directions.

3:55-4:55 PM – Community Discussion, M. Scot Breitenfeld, The HDF Group
Video

The HDF Group will talk about some recent changes to improve communication and engagement. Please add your topics for discussion to the session notes in the google doc.

Tuesday, August 6, 2024

8:00-9:00 AM – BREAKFAST

9:00-10:00 AM – KEYNOTE: Uncharted Territory – Exploring New Frontiers for HDF5 – Quincey Koziol, NVIDIA
Slide Deck | Video

Blast-off: GPU Accelerated HDF5
The industry’s shift to accelerated computing has moved the goal posts for accessing application data, and provided dramatically faster compute resources for HDF5 transforms. This talk describes a path to directly accessing GPU memory for data access operations and moving HDF5 data transforms from the host to the GPU.

We’re Breaking Up! Disaggregated HDF5 Containers on Object Storage Systems
Object storage systems dominate cloud computing and are making significant inroads into the on-premise storage deployments as well. This talk describes storing HDF5 containers (files, today) natively on object storage systems in high-performance and portable way. Disaggregating HDF5 metadata from application data and storing it in a lightweight database, e.g. SQLite, provides improvements to performance and resiliency, and adds query capabilities that are expensive with today’s native file format. Sharding application data into multiple objects enables greater parallelism and performance as well, especially with cloud object stores.

Not for Spotify: Streaming HDF5 Data
HDF5 is ubiquitous across science domains, but has typically targeted array-like access patterns. Streaming data from cameras, particle accelerators, and other high-speed data sources into HDF5 files is also common. This talk presents new HDF5 APIs and format extensions that are tailored to this use case and maximize performance for both writing and reading data in a streamed fashion.

10:00-10:15 AM – BREAK

10:15-10:40 AM – Enlarging Effective DRAM Capacity through Hermes – Luke Logan, Research Software Engineer and PhD student at Gnosis Research Center
Slide Deck | Video

Traditionally, memory and I/O substrates have been considered separate entities due to their differences in terms of performance and persistence. However, modern data-intensive memory-centric workloads widespread in HPC are challenging these distinctions. Data analytics, machine learning, and deep learning codes perform large-scale computations on data which greatly exceed the bounds of memory, relying on explicit data movements to I/O systems to meet basic capacity requirements. This often leads to significantly increased development complexity and suboptimal, one-off solutions where I/O and compute happen in distinct, synchronous phases, incurring the memory wall problem in the compute phase and the notorious I/O bottleneck during the I/O phase. Conversely, scientific simulation codes are becoming increasingly memory-intensive and are developed assuming large memory capacities are provided to avoid out-of-core development complexity. To reduce complexity and I/O costs, HPC and Cloud sites have been increasing DRAM capacities. However, while many applications desire an effectively infinite memory to generate and analyze massive datasets, the ever-increasing size of data and the extreme financial and energy costs of DRAM make scaling DRAM capacity unsustainable.

In this work, we expose the Hermes I/O buffering system as a software distributed shared memory (DSM) that enlarges effective memory capacity through intelligent tiered DRAM and storage management. This DSM provides workload-aware data organization, eviction, and prefetching policies to reduce DRAM consumption while ensuring speedy access to critical data. Evaluations show that various workloads can be executed with a fraction of the DRAM while offering competitive performance.

10:40-11:05 AM – DataStates: Scalable Lineage-Driven Data Management In The Age of AI – Bogdan Nicolae, Argonne National Laboratory
Video

Checkpointing is the most widely used approach to provide resilience for HPC applications by enabling restart in case of failures. However, coupled with a searchable lineage that records the evolution of intermediate data and metadata during runtime, it can become a powerful technique in a wide range of scenarios at scale: verify and understand the results more thoroughly by sharing and analyzing intermediate results (which facilitates provenance, reproducibility, and explainability), new algorithms and ideas that reuse and revisit intermediate and historical data frequently (either fully or partially), manipulation of the application states (job pre-emption using suspend-resume, debugging), etc. This talk advocates a new data model and associated tools (DataStates, VELOC) that facilitate such scenarios. In the age of AI, this approach has particularly interesting applications: repositories for derived models that keep provenance and enable incremental storage (e.g. in the context of NAS), data pipelines with historic access (e.g. for continual learning based on rehearsal), evaluation of intermediate training stages and surviving model spikes (especially for LLMs), etc.

11:05-11:30 AM – Update on Cloud Optimized HDF5 Files – Dr. Aleksandar Jelenak, The HDF Group
Slide Deck | Video

Ever since introducing the concept of cloud optimized HDF5 files (COH5) users were interested about specific storage settings for these files and their impact on performance. This talk will provide some answers to such questions based on the work with NASA stakeholders and their satellite HDF5 data.

11:30-11:55 AM – HSDS Multi API – Matt Larson, The HDF Group
Slide Deck | Video

The Highly Scalable Data Service (HSDS) allows the HDF5 data model to be used in the cloud through an HTTP REST interface, facilitating seamless interaction with cloud-based object storage. However, cloud storage access patterns optimized for local machines can become costly and inefficient. This talk will introduce the new Multi-Interface for HSDS, showcasing how it can achieve significant performance improvements. Additionally, this talk will provide an introduction to key components of the Python ecosystem for HDF5: h5py, HSDS, and h5pyd.

11:55-1:00 PM – LUNCH

1:00-1:25 PM – Potential revision to MPI-IO consistency and its impact on HDF5 – Chen Wang, Lawrence Livermore National Laboratory
Slide Deck | Video

The first Message Passing Interface (MPI) standard (MPI 1.0) was published in 1994, aiming to create a widely used standard for writing message-passing programs. However, I/O was not included in the initial MPI standard; support for parallel I/O was added later in MPI-2.0, which was published in 1997. Since then, I/O has become an integral part of the MPI standard. The design of MPI-IO has remained largely unchanged over the years, and we have observed some performance limitations in HPC systems.

MPI-IO serves as middleware that sits beneath the parallel HDF5 implementation. Therefore, HDF5’s performance is significantly influenced by the underlying MPI-IO implementation. In this talk, I will discuss the performance implications of the current MPI-IO design, with a special focus on its interface and consistency semantics. Additionally, I will discuss an ongoing effort within the MPI-IO workgroup to revise the current MPI-IO consistency design, and how these potential revisions could affect HDF5, considering both programmability and performance.

1:25-1:30 PM – Efficient HDF5 Data Access for Exa-scale Scientific Application – Houjun Tang, Berkeley Lab (Lightning Session)
Slide Deck | Video

New exascale-scale supercomputers advance scientists’ ability to perform large-scale simulations. The regional earthquake simulations in the DOE-funded EQSIM project can now create high-resolution, spatially dense ground motions that provide new insights into understanding and quantifying seismic risk. However, accessing and working with these massive datasets, which can be over hundreds of terabytes, can be a significant challenge. In this talk, I will discuss our recent developments for practical data access and visualization of large HDF5 datasets.

1:30-2:30 PM – StoreHub: Building a Community for the Future of Data Storage Research – Anthony Kougkas, Illinois Tech
Slide Deck | Survey | Video

The evolution of data management and storage technologies necessitates a collaborative effort from researchers, developers, and industry partners to address emerging challenges and shape the future of data-intensive applications. This BoF session will introduce StoreHub, an NSF-funded cyberinfrastructure planning effort dedicated to advancing data storage research. The session will engage the HPC community to discuss the future of data storage research, gather input on StoreHub’s deployment, and explore potential collaborations to shape the cyberinfrastructure to be deployed. Participants will gain early insights into this initiative and have the opportunity to influence its direction, fostering data storage innovations and collaborations. Please make sure to do the survey for this project: Community Insight Survey for Data Storage Research

2-30-2:55 PM– HDF5 Subfiling: A Scalable Approach to Exascale I/O – M. Scot Breitenfeld, The HDF Group
Slide Deck | Video

The increasing size and complexity of scientific datasets present significant challenges for data management at exascale. This presentation examines the potential of HDF5’s subfiling feature, a key component of HDF5, to address these challenges. We will explore the concept of subfiling, explaining how it enables efficient access at exascale.

The presentation will also demonstrate real-world use cases where subfiling can significantly enhance I/O performance. We will discuss how subfiling can optimize data access patterns for exascale applications, resulting in faster read and write operations and decreased overall processing time. Additionally, we will analyze the performance benefits of subfiling compared to traditional HDF5 access methods.

2:55-3:15 PM – Poster Sessions: 5 minute talks from the following: (20 minutes)

Optimizing Workflow Performance by Elucidating Semantic Data Flow – Meng Tang, Illinois Institute of Technology
Slide Deck and Poster | Video

The combination of ever-growing scientific datasets and distributed workflow complexity creates I/O performance bottlenecks due to data volume, velocity, and variety. Although the increasing use of descriptive data formats (e.g., HDF5) helps organize these datasets, it also creates obscure bottlenecks due to the need to translate high-level operations into file addresses and then into low-level I/O operations. To address this challenge, we propose using Semantic Dataflow Graphs to analyze (a) relationships between logical datasets and file addresses, (b) how dataset operations translate into I/O, and (c) the combination across entire workflows. Our analysis and visualization enable the identification of performance bottlenecks and reasoning about performance improvement in workflows.

Enlarging Effective DRAM Capacity through Hermes – Luke Logan, Gnosis Research Center

Luke has a session on his poster’s content at 10:15 on Tuesday, so he will not repeat a talk in this time slot. However, Luke will be at his poster during the break for additional questions.

DTIO: Unifying I/O for HPC and AI – Keith Bateman, Illinois Institute of Technology
Slide Deck and Poster | Video

HPC, Big Data Analytics, and Machine Learning have become increasingly intertwined, as popular models such as LLMs and Diffusion Models have driven discovery in fields such as molecular simulation and cosmology. Applications like GenSLMs and OpenFold have proven the value of ML in accelerating scientific applications. However, The convergence of these fields is incomplete, as each has its own storage infrastructure with unique I/O interfaces and storage systems. For HPC, the typical storage infrastructure involves a Parallel File System and HDF5, MPI-IO, or POSIX, while ML workloads such as RAG may utilize a distributed vector database. Their application domains have different I/O needs, with HPC typically utilizing write-intensive bulk operations while ML has read-intensive small operations. There is a need for a system which unifies the existing I/O stack for the convergence of HPC and ML. For this purpose, we propose DTIO, a DataTask I/O Library. DTIO will preserve the semantics of ML and HPC storage stacks while providing transparent data placement for a given data object. It will provide delayed consistency, which achieves better performance and decision-making by performing I/O asynchronously during compute phases. It will replicate tasks across various storage to serve the purposes of converged workflows. Finally, it will utilize I/O interception and translation to the DataTask abstraction in order to accomplish these objectives.

I/O model based on HDF5 – Hua Xu, Gnosis Research Center (IIT)
Slide Deck | Video

As computer applications become more data-intensive, their demands on storage systems for efficient storage and retrieval have significantly increased. Compute resources on clusters are often used exclusively by users for maximizing performance, but storage resources are shared across multiple users for better utilization. In such environments where resources are shared by workloads, an application’s IO performance can vary significantly due to interference from other jobs. A related problem is that of scheduling user jobs on a cluster to maximize resource utilization and minimize total execution time. A data acquisition system(DAC) deployed on clusters is a useful tool that can be used by job schedulers to make informed scheduling decisions. In this work we propose a DAC with predictive models that can learn the I/O workloads on clusters and provide predictions on system performance.

Modeling the performance of the storage layer on clusters is challenging due to the presence of multiple interacting software, sophisticated hardware, variable file types and layouts on disks and variable IO traffic from users. User-observed IO performance depends on the IO library and its usage of the file system. The IO library’s metadata APIs and the available parallelism in the file system affect the parallel IO performance. The file layout on disks (stripe count and stripe size) is another significant factor that affects load balance and parallelism in the storage layer. The impact of interference from other users is hard to model accurately. This interference is one of the reasons why empirical models of IO performance and storage systems have not been successful for modern HPC systems.

We propose a supervised learning based I/O model that updates itself with feedback from the cluster. This IO model will predict the IO time (read/write) per process for a given file layout, average IO request size (number of bytes), number of concurrent readers/writers, IO servers and storage disks. The learning framework will consist of a trained base performance model which will be continually updated as new data arrives. Updates will be incorporated in to the base model by minimizing the influence of outliers to provide accurate predictions in the presence of interference. The predicted IO performance is an indicator of the current load on the storage servers. It can be used by the job scheduling algorithm to minimize the total IO time of a set of IO jobs on the cluster.

This work will be carried out on the Ares cluster, which consists of one rack of compute nodes. All nodes share a 48TB RAID-5 storage pool comprised of eight 8TB 7200 RPM SAS hard drives. Nodes within each rack are connected with 40Gbps Ethernet with RoCE support. The model will be built and analyzed for the HDF5 file format, with ROMIO extensions for MPI-IO and PVFS2 (Parallel Virtual File System). Key parameters in HDF5 and PVFS, such as the number of processes, servers, and clients in PVFS, and stripe size, are considered as significant parameters for the model.

HDF5 to drive research and new target discovery in immuno-oncology – Vinay Vyas, Arcus Biosciences
Video

Leveraging HDF for managing and storing the gene sets analyzed at Arcus Biosciences to pursue a cure for cancer. HDF based design helps adhere to FAIR – Findable, Interoperable, Accessible and Reusable principles. The data storage model can help scientists retrieve comprehensive information about genes of interest by capturing its membership in various pathways, co-expression profile, it’s gene families, location in a genome etc. Having an internal database with context driven concepts with predetermined offsets could help retrieve complex data sets quickly and significantly improve productivity of scientists in finding novel targets for cancer research.

3:15-3:50 PM – POSTER BREAK

Have snacks and browse the posters. Presenters will be standing by for your questions.

3:50-4:15 PM – HDF5 infrastructure in DUNE – Barnali Chowdhury, Argonne National Laboratory
Video

The Deep Underground Neutrino Experiment (DUNE), is expected to begin operations in the late 2020s. DUNE will make discoveries of the unknown neutrino mass ordering and precise measurement of neutrino mixing parameter of CP violation. In a single experiment, DUNE will enable a broad exploration of the three-flavor model of neutrino physics with unprecedented detail. DUNE will consist of a modular far detector (FD) located at SURF in South Dakota, USA, and a near detector (ND) located on site at Fermilab in Illinois. Recently, DUNE has begun using the Hierarchical Data Format (HDF5) to record raw data from two prototypes, ProtoDUNE Horizontal Drift (HD) and Vertical Drift (VD), for its data storage applications. Dedicated I/O modules are developed to read the HDF5 data from these detectors into the offline framework for reconstruction, directly and via XRootD. DUNE’s large-scale offline data uses streaming via XrootD to take advantage of CPU at sites without large data stores. The event reconstruction software of DUNE Near detector (ND) relies heavily on AI/ML applications and High Performance Computers (HPCs). Choosing HDF5 makes it well-suited for the AI/ML workflow in HPC. In collaboration with HEP-CCE, DUNE is exploring different HDF5 formats optimized for GPU and developing strategies to incorporate HDF5 in the detector simulation chains of DUNE far detector.

4:15-4:30 PM – US-RSE: Empowering Hidden Contributors Driving Science – Sandra Gesing, Executive Director, US Research Software Engineer Association (US-RSE)
Slide Deck | Video

Over the past decade, academia and national labs have increasingly recognized the crucial role of hidden contributors contributing to accelerating science. The acknowledgement is evident in quite some projects. From the founding of 8 Research Software Engineer Associations worldwide to the dedicated efforts of the NSF Center of Excellence for Science Gateways. While it is encouraging that the importance of research software and projects such as HDF5 by the HDFGroup, we still have a long road in front of us for well-defined career paths and incentives for people in the line of work of creating research software. A multi-faceted approach is needed to meet researchers and educators as well as the hidden contributors where they are. This talk will delve into the crucial role of research software engineers in advancing research and computational activities. Furthermore, It will highlight the importance of fostering a community that encompasses all stakeholders in academia and national labs, advocating for a cultural change and actionable measures on how everyone can contribute to make it happen.

4:30-4:40 PM – Student Introductions – Lori Cooper, facilitator

Students and instructors will come up for introductions. You’ll meet each student, learn where they’re studying, when they are graduating, and their research interests. Keep these names and faces in mind when we’re at dinner later and use that opportunity to talk to the future of research and industry!

6:00 PM – Banquet at Greek Islands, 200 S Halsted Chicago, IL. 60661

Menu for our Banquet Dinner

Dinner is included for all registrants! If you have a +1, please email hug@hdfgroup.org. We should be able to include them for just the cost of the meal.

Wednesday, August 7, 2024

8:00-9:00 AM – BREAKFAST

9:00-9:45 AM – Best Practices for HDF5 – The HDF Group Staff and contributors

9:45-10:20 AM – HDF5 Performance Tuning – M. Scot Breitenfeld, The HDF Group
Slide Deck | Video

10:20-11:00 AM – Crashproofing – Neil Fortner, The HDF Group
Slide Deck | Video

11:00-12:00 PM – Cloud Ready HDF5 – Matt Larson, John Readey, and Aleksandar Jelenak, The HDF Group
Slide Deck | Video

The HDF Group understands the importance of enabling HDF in the cloud. In the last decade, we have added the ROS3 VFD for read-only access to HDF5 files on S3, defined a cloud-optimized HDF5 file format (CoHDF5), and developed HSDS (a REST-based service for HDF data). But what are the most important things to tackle going forward? In this talk we discuss five possible projects that would advance the capabilities of HDF in the cloud. All are in the early ideation stage. Does the community feel that these are the right priorities? Come listen, and let us know what you think.

12:00-1:00 PM – LUNCH