航空学报 > 2024, Vol. 45 Issue (7): 128888-128888   doi: 10.7527/S1000-6893.2023.28888

面向多核CPU/众核GPU架构的非结构CFD共享内存并行计算技术

张健1,2, 李瑞田2, 邓亮2(), 代喆2, 刘杰1, 徐传福1   

  1. 1.国防科技大学 并行与分布计算全国重点实验室,长沙 410073
    2.中国空气动力研究与发展中心 计算空气动力研究所,绵阳 621000
  • 收稿日期:2023-04-19 修回日期:2023-05-14 接受日期:2023-05-26 出版日期:2024-04-15 发布日期:2023-05-29
  • 通讯作者: 邓亮 E-mail:dengliang11@nudt.edu.cn
  • 基金资助:
    国家数值风洞(NNW)工程;四川省科技计划(2023YFG0152)

Shared⁃memory parallelization technology of unstructured CFD solver for multi⁃core CPU/many⁃core GPU architecture

Jian ZHANG1,2, Ruitian LI2, Liang DENG2(), Zhe DAI2, Jie LIU1, Chuanfu XU1   

  1. 1.National Key Laboratory of Parallel and Distributed Computing,National University of Defense Technology,Changsha 410073,China
    2.Computational Aerodynamic Institute,China Aerodynamic Research and Development Center,Mianyang 621000,China
  • Received:2023-04-19 Revised:2023-05-14 Accepted:2023-05-26 Online:2024-04-15 Published:2023-05-29
  • Contact: Liang DENG E-mail:dengliang11@nudt.edu.cn
  • Supported by:
    National Numerical Wind-tunnel (NNW) Project of China;Sichuan Science and Technology Program(2023YFG0152)

摘要:

针对现代高性能计算机架构开展非结构CFD节点内共享内存并行,是提升浮点计算效率、实现更大规模流体仿真应用能力的关键。然而由于非结构网格CFD计算存在拓扑关系复杂、数据局部性差、数据写冲突等问题,传统算法如何共享内存并行化以高效发挥多核CPU/众核GPU的硬件能力,成为一个重大的挑战。从一个工业级非结构CFD软件出发,通过深入分析其计算行为和访存模式,设计实现了多种共享内存并行算法,采用了网格重排序、循环融合、多级访存等数据局部性优化技术进一步提升性能。面向多核CPU架构,系统开展了循环级与任务级两种并行模式的对比研究;面向众核GPU架构,创新地提出了一种基于多级访存优化方法的规约并行策略。利用M6机翼和CHN-T1飞机算例对所有并行算法及优化技术进行了全面验证与评估。结果表明:在多核CPU平台上,基于剖分复制的任务级并行策略性能最好,采用Cuthill-McKee重排序以及循环融合分别使整体性能提升10%。在众核GPU平台上,基于多级访存的规约策略具有显著的加速效果,优化后热点函数的性能相比优化前提升了3倍,相比CPU串行性能整体加速比可达127。

关键词: 非结构网格, CFD, 共享内存并行, GPU, 访存优化

Abstract:

Shared memory parallelization for unstructured CFD on modern high-performance computer architecture is the key to improve the efficiency of floating point computing and realizing large-scale fluid simulation application capabilities. However, due to problems such as the complex topological relationship, poor data locality, and data write conflict in unstructured CFD computing, parallelization of the traditional algorithms in shared memory to efficiently explore the hardware capabilities of multi-core CPUs/many-core GPUs has become a significant challenge. Starting from industrial-level unstructured CFD software, a variety of shared memory parallel algorithms are designed and implemented by deeply analyzing the computing behavior and memory access mode, and data locality optimization technologies such as grid reordering, loop fusion, and multi-level memory access are used to further improve performance. Specifically, a comprehensive study is conducted on two parallel modes, loop-based and task-based, for multi-core CPU architectures. An innovative reduction parallel strategy based on a multi-level memory access optimization method is proposed for the many-core GPU architecture. All the parallel methods and optimization techniques implemented are deeply analyzed and evaluated by the test cases of the M6 wing and CHN-T1 airplane. The results show that the parallel strategy of division and replication has the best performance on the CPU platform. Using Cuthill-McKee grid renumbering and loop fusion techniques to optimize memory access can improve performance by 10%, respectively. For GPU platforms, the proposed reduction strategy combined with multi-level memory access optimization has a significant acceleration effect. For the hot spot subroutine with data racing, the speed-up can be further improved by 3 times, and the overall speed-up can reach 127.

Key words: unstructured-grid, CFD, shared memory parallelization, GPU, memory access optimization

中图分类号: