导航

Acta Aeronautica et Astronautica Sinica ›› 2024, Vol. 45 ›› Issue (7): 128888-128888.doi: 10.7527/S1000-6893.2023.28888

• Fluid Mechanics and Flight Mechanics • Previous Articles     Next Articles

Shared⁃memory parallelization technology of unstructured CFD solver for multi⁃core CPU/many⁃core GPU architecture

Jian ZHANG1,2, Ruitian LI2, Liang DENG2(), Zhe DAI2, Jie LIU1, Chuanfu XU1   

  1. 1.National Key Laboratory of Parallel and Distributed Computing,National University of Defense Technology,Changsha 410073,China
    2.Computational Aerodynamic Institute,China Aerodynamic Research and Development Center,Mianyang 621000,China
  • Received:2023-04-19 Revised:2023-05-14 Accepted:2023-05-26 Online:2024-04-15 Published:2023-05-29
  • Contact: Liang DENG E-mail:dengliang11@nudt.edu.cn
  • Supported by:
    National Numerical Wind-tunnel (NNW) Project of China;Sichuan Science and Technology Program(2023YFG0152)

Abstract:

Shared memory parallelization for unstructured CFD on modern high-performance computer architecture is the key to improve the efficiency of floating point computing and realizing large-scale fluid simulation application capabilities. However, due to problems such as the complex topological relationship, poor data locality, and data write conflict in unstructured CFD computing, parallelization of the traditional algorithms in shared memory to efficiently explore the hardware capabilities of multi-core CPUs/many-core GPUs has become a significant challenge. Starting from industrial-level unstructured CFD software, a variety of shared memory parallel algorithms are designed and implemented by deeply analyzing the computing behavior and memory access mode, and data locality optimization technologies such as grid reordering, loop fusion, and multi-level memory access are used to further improve performance. Specifically, a comprehensive study is conducted on two parallel modes, loop-based and task-based, for multi-core CPU architectures. An innovative reduction parallel strategy based on a multi-level memory access optimization method is proposed for the many-core GPU architecture. All the parallel methods and optimization techniques implemented are deeply analyzed and evaluated by the test cases of the M6 wing and CHN-T1 airplane. The results show that the parallel strategy of division and replication has the best performance on the CPU platform. Using Cuthill-McKee grid renumbering and loop fusion techniques to optimize memory access can improve performance by 10%, respectively. For GPU platforms, the proposed reduction strategy combined with multi-level memory access optimization has a significant acceleration effect. For the hot spot subroutine with data racing, the speed-up can be further improved by 3 times, and the overall speed-up can reach 127.

Key words: unstructured-grid, CFD, shared memory parallelization, GPU, memory access optimization

CLC Number: