Particle physics research

HipBone, GPU-enabled asynchronous tasks, auto-tuning, and more.

In this regular feature, HPCwire highlights recently published research in the high performance computing community and related fields. From parallel programming to exascale to quantum computing, the details are here.

A two-MPI process mesh arrangement of third-order 2D spectral elements. Credit: Chalmers et al.

HipBone: A high-performance portable GPU-accelerated C++ version of the NekBone benchmark

Using three HPC systems from the Oak Ridge lab – the Summit supercomputer and the Frontier, Spock and Crusher early access clusters – the academic and industrial research team (which includes two AMD authors) demonstrated the performance of hipBone, an open source application for the Nek5000 computational fluid dynamic applications. HipBone “is a fully GPU-accelerated C++ implementation of the original NekBone CPU proxy application with several new algorithmic and implementation improvements that optimize its performance on modern fine-grained parallel GPU accelerators.” Tests demonstrate hipBone’s “portability across different clusters and very good scaling efficiency, especially on large problems.”

Authors: Noel Chalmers, Abhishek Mishra, Damon McDougall and Tim Warburton

A case for intra-rack resource disaggregation in HPC

A multi-agency research team used Cori, a high-performance computing system from the National Energy Research Scientific Computing Center, to analyze “resource disaggregation to enable finer-grained allocation of hardware resources to applications.” In their paper, the authors also describe a “variety of deep learning applications to represent an emerging workload.” The researchers demonstrated that “for a rack configuration and applications similar to Cori, a central processing unit with intra-rack disaggregation has a 99.5% probability of finding all the resources it needs inside his rack”.

Authors: George Michelogiannakis, Benjamin Klenk, Brandon Cook, Min Yee Teh, Madeleine Glick, Larry Dennison, Keren Bergman and John Shalf

Jacobi 3D MPI example (Jacobi3D) with a manual overlap option. Credit: Choi et al.

Improved scalability with GPU-enabled asynchronous tasks

Computer scientists from the University of Illinois at Urbana-Champaign and Lawrence Livermore National Laboratory have demonstrated improved scalability to hide the communication behind the computation with GPU-enabled asynchronous tasks. According to the authors, “While the ability to hide the communication behind the compute can be very effective in low scaling scenarios, performance begins to suffer with smaller size issues or in a smaller scale. strong scale due to fine-grained overhead and reduced overlap margin”. The authors integrated “GPU-enabled communication into asynchronous tasks in addition to compute-communication overlap, with the goal of reducing the time spent in communication and further increasing GPU utilization.” They were able to demonstrate the impact of their approach on performance by using “a proxy application that runs the iterative Jacobi method on GPUs, Jacobi3D”. In their article, the authors also delve into “techniques such as kernel fusion and CUDA graphics to combat large-scale overhead”.

Authors: Jaemin Choi, David F. Richards, Laxmikant V. Kale

A convolutional neural network-based approach for computational fluid dynamics

To overcome the cost, time and memory drawbacks of using computational fluid dynamics (CFD) simulation, this Indian research team proposed to use “a model based on convolutional neural networks , to predict a non-uniform flow in 2D”. They define CFD as “the visualization of how a fluid moves and interacts with things as it passes by using applied mathematics, physics, and computational software.” The authors’ approach “aims to facilitate the behavior of fluid particles on a certain system and to aid in the development of the system based on the fluid particles passing through it. In the early stages of design, this technique can provide rapid feedback for real-time design reviews. »

Authors: Satyadhyan Chickerur and P Ashish

A single block of the variational wave function in terms of parameterized quantum circuits. Credit: Rinaldi et al.

Matrix model simulations using quantum computing, deep learning and lattice Monte Carlo

This international research team conducted “the first systematic investigation of quantum computing and deep learning approaches to matrix quantum mechanics”. While the “Monte Carlo simulations of the Euclidean lattice are the de facto numerical tool to understand the spectrum of large matrix patterns and have been used to test holographic duality,” the authors write, “they are not suitable for extracting the dynamical properties or even the ground state quantum wave function of matrix models. The authors compare deep learning approaches to lattice Monte Carlo simulations and provide basic references. The research relied on Riken’s HOKUSAI “BigWaterfall” supercomputer.

Authors: Enrico Rinaldi, Xizhi Han, Mohammad Hassan, Yuan Feng, Franco Nori, Michael McGuigan and Masanori Hanada

GPTuneBand: multi-tasking, multi-fidelity auto-tuning for large-scale, high-performance computing applications

A group of researchers from Cornell University and the Lawrence Berkeley National Laboratory propose a multi-tasking, multi-fidelity auto-tuning framework, called GPTuneBand for tuning high-performance computing applications. GPTuneBand combines a multi-tasking Bayesian optimization algorithm with a multi-armed bandit strategy, well suited for tuning expensive HPC applications such as digital libraries, scientific simulation codes and machine learning models, especially with a very limited tuning budget,” the authors write. . Compared to its predecessor, GPTuneBand demonstrated “maximum 1.2x speedup and trumps single-task, multi-fidelity BOHB tuner on 72.5% of tasks.”

Authors: Xinran Zhu, Yang Liu, Pieter Ghysels, David Bindel, Xiaoye S. Li

High-performance computing architecture for processing sample values ​​in the smart grid

In this Open Access article, a group of researchers from the University of the Basque Country, Spain, presents a high-level interface solution for application designers that addresses the challenges of current technologies for the Smart Grid. Claiming that FPGAs offer higher performance and reliability than CPUs, the authors present a “solution to accelerate the computation of hundreds of streams, combining a custom-designed silicon intellectual property and an accelerator board based on a gate array next-generation programmable devices. Researchers leverage FPGAs and the adaptive computing framework of Xilinx.

Authors: Le Sun, Leire Muguira, Jaime Jiménez, Armando Astarloa, Jesús Lázaro

Do you know of any research that should be included in next month’s list? If so, email us at [email protected] We look forward to hearing from you.