面向多核CPU/众核GPU架构的非结构CFD共享内存并行计算技术

doi:10.7527/S1000-6893.2023.28888

流体力学与飞行力学

本期目录 | 过刊浏览 | 高级检索

前一篇 | 后一篇

面向多核CPU/众核GPU架构的非结构CFD共享内存并行计算技术

张健¹^,², 李瑞田², 邓亮²(), 代喆², 刘杰¹, 徐传福¹

^1.国防科技大学并行与分布计算全国重点实验室，长沙 410073
^2.中国空气动力研究与发展中心计算空气动力研究所，绵阳 621000

收稿日期:2023-04-19 修回日期:2023-05-14 接受日期:2023-05-26 出版日期:2024-04-15 发布日期:2023-05-29
通讯作者: 邓亮 E-mail:dengliang11@nudt.edu.cn
基金资助:
国家数值风洞（NNW）工程;四川省科技计划(2023YFG0152)

Shared⁃memory parallelization technology of unstructured CFD solver for multi⁃core CPU/many⁃core GPU architecture

Jian ZHANG¹^,², Ruitian LI², Liang DENG²(), Zhe DAI², Jie LIU¹, Chuanfu XU¹

^1.National Key Laboratory of Parallel and Distributed Computing，National University of Defense Technology，Changsha 410073，China
^2.Computational Aerodynamic Institute，China Aerodynamic Research and Development Center，Mianyang 621000，China

Received:2023-04-19 Revised:2023-05-14 Accepted:2023-05-26 Online:2024-04-15 Published:2023-05-29
Contact: Liang DENG E-mail:dengliang11@nudt.edu.cn
Supported by:
National Numerical Wind-tunnel （NNW） Project of China;Sichuan Science and Technology Program(2023YFG0152)

摘要/Abstract

摘要：

针对现代高性能计算机架构开展非结构CFD节点内共享内存并行，是提升浮点计算效率、实现更大规模流体仿真应用能力的关键。然而由于非结构网格CFD计算存在拓扑关系复杂、数据局部性差、数据写冲突等问题，传统算法如何共享内存并行化以高效发挥多核CPU/众核GPU的硬件能力，成为一个重大的挑战。从一个工业级非结构CFD软件出发，通过深入分析其计算行为和访存模式，设计实现了多种共享内存并行算法，采用了网格重排序、循环融合、多级访存等数据局部性优化技术进一步提升性能。面向多核CPU架构，系统开展了循环级与任务级两种并行模式的对比研究；面向众核GPU架构，创新地提出了一种基于多级访存优化方法的规约并行策略。利用M6机翼和CHN-T1飞机算例对所有并行算法及优化技术进行了全面验证与评估。结果表明：在多核CPU平台上，基于剖分复制的任务级并行策略性能最好，采用Cuthill-McKee重排序以及循环融合分别使整体性能提升10%。在众核GPU平台上，基于多级访存的规约策略具有显著的加速效果，优化后热点函数的性能相比优化前提升了3倍，相比CPU串行性能整体加速比可达127。

关键词: 非结构网格, CFD, 共享内存并行, GPU, 访存优化

Abstract:

Shared memory parallelization for unstructured CFD on modern high-performance computer architecture is the key to improve the efficiency of floating point computing and realizing large-scale fluid simulation application capabilities. However， due to problems such as the complex topological relationship， poor data locality， and data write conflict in unstructured CFD computing， parallelization of the traditional algorithms in shared memory to efficiently explore the hardware capabilities of multi-core CPUs/many-core GPUs has become a significant challenge. Starting from industrial-level unstructured CFD software， a variety of shared memory parallel algorithms are designed and implemented by deeply analyzing the computing behavior and memory access mode， and data locality optimization technologies such as grid reordering， loop fusion， and multi-level memory access are used to further improve performance. Specifically， a comprehensive study is conducted on two parallel modes， loop-based and task-based， for multi-core CPU architectures. An innovative reduction parallel strategy based on a multi-level memory access optimization method is proposed for the many-core GPU architecture. All the parallel methods and optimization techniques implemented are deeply analyzed and evaluated by the test cases of the M6 wing and CHN-T1 airplane. The results show that the parallel strategy of division and replication has the best performance on the CPU platform. Using Cuthill-McKee grid renumbering and loop fusion techniques to optimize memory access can improve performance by 10%， respectively. For GPU platforms， the proposed reduction strategy combined with multi-level memory access optimization has a significant acceleration effect. For the hot spot subroutine with data racing， the speed-up can be further improved by 3 times， and the overall speed-up can reach 127.

Key words: unstructured-grid, CFD, shared memory parallelization, GPU, memory access optimization

中图分类号:

V211.3

张健, 李瑞田, 邓亮, 代喆, 刘杰, 徐传福. 面向多核CPU/众核GPU架构的非结构CFD共享内存并行计算技术[J]. 航空学报, 2024, 45(7): 128888-128888.

Jian ZHANG, Ruitian LI, Liang DENG, Zhe DAI, Jie LIU, Chuanfu XU. Shared⁃memory parallelization technology of unstructured CFD solver for multi⁃core CPU/many⁃core GPU architecture[J]. Acta Aeronautica et Astronautica Sinica, 2024, 45(7): 128888-128888.

图/表 21

图1

图2

图3

图4

图5

图6

图7

图8

图9

图10

图11

图12

表1

图13

表2

图14

图15

图16

图17

图18

图19

参考文献 27

1	阎超. 航空CFD四十年的成就与困境［J］. 航空学报， 2022， 43（10）： 526490.
	YAN C. Achievements and predicaments of CFD in aeronautics in past forty years［J］. Acta Aeronautica et Astronautica Sinica， 2022， 43（10）： 526490 （in Chinese）.
2	张子佩，赵钟，陈坚强，等. 风雷软件LES开发设计与验证［J］. 航空学报， 2023， 44（6）： 127171.
	ZHANG Z P， ZHAO Z， CHEN J Q， et al. Development and verification of LES model in NNW-PHengLEI［J］. Acta Aeronautica et Astronautica Sinica， 2023， 44（6）： 127171 （in Chinese）.
3	KROLL N， ABU-ZURAYK M， DIMITROV D， et al. DLR Project Digital-X： Towards virtual aircraft design and flight testing based on high-fidelity methods［J］. CEAS Aeronautical Journal， 2016， 7（1）： 3-27.
4	刘朋欣，袁先旭，孙东，等. 高温化学非平衡湍流边界层直接数值模拟［J］. 航空学报， 2022， 43（1）： 124877.
	LIU P X， YUAN X X， SUN D， et al. Direct numerical simulation of high-temperature turbulent boundary layer with chemical nonequilibrium［J］. Acta Aeronautica et Astronautica Sinica， 2022， 43（1）： 124877 （in Chinese）.
5	张来平，邓小刚，何磊，等. E级计算给CFD带来的机遇与挑战［J］. 空气动力学学报， 2016， 34（4）： 405-417.
	ZHANG L P， DENG X G， HE L， et al. The opportunity and grand challenges in computational fluid dynamics by exascale computing［J］. Acta Aerodynamica Sinica， 2016， 34（4）： 405-417 （in Chinese）.
6	刘胜，卢凯，郭阳，等. 一种自主设计的面向E级高性能计算的异构融合加速器［J］. 计算机研究与发展， 2021， 58（6）： 1234-1237.
	LIU S， LU K， GUO Y， et al. A self-designed heterogeneous accelerator for exascale high performance computing［J］. Journal of Computer Research and Development， 2021， 58（6）： 1234-1237 （in Chinese）.
7	龚春叶，刘杰，包为民，等. 后摩尔时代国产高性能并行应用软件生态建设综述［J］. 系统仿真学报， 2022， 34（10）： 2107-2118.
	GONG C Y， LIU J， BAO W M， et al. Review on ecological construction of domestic high-performance parallel application software in post Moore era［J］. Journal of System Simulation， 2022， 34（10）： 2107-2118 （in Chinese）.
8	CARY A， CHAWNER J， DUQUE E， et al. Realizing the vision of CFD in 2030［J］. Computing in Science & Engineering， 2022， 24（1）： 64-70.
9	ECONOMON T D， MUDIGERE D， BANSAL G， et al. Performance optimizations for scalable implicit RANS calculations with SU2［J］. Computers & Fluids， 2016， 129： 146-158.
10	GARCIA-GASULLA M， HOUZEAUX G， FERRER R， et al. MPI+X： Task-based parallelisation and dynamic load balance of finite element assembly［J］. International Journal of Computational Fluid Dynamics， 2019， 33（3）： 115-136.
11	ATKINSON PATRICK R. Enabling task parallelism for many-core architectures［D］. Bristol： University of Bristol， 2021
12	陈坚强，吴晓军，张健，等. FlowStar：国家数值风洞（NNW）工程非结构通用CFD软件［J］. 航空学报， 2021， 42（9）： 625739.
	CHEN J Q， WU X J， ZHANG J， et al. FlowStar： General unstructured-grid CFD software for national numerical windtunnel（NNW） project［J］. Acta Aeronautica et Astronautica Sinica， 2021， 42（9）： 625739 （in Chinese）.
13	王年华，常兴华，赵钟，等. 非结构CFD软件MPI+OpenMP混合并行及超大规模非定常并行计算的应用［J］. 航空学报， 2020， 41（10）： 123859.
	WANG N H， CHANG X H， ZHAO Z， et al. Implementation of hybrid MPI+OpenMP parallelization on unstructured CFD solver and its applications in massive unsteady simulations［J］. Acta Aeronautica et Astronautica Sinica， 2020， 41（10）： 123859 （in Chinese）.
14	张曦，孙旭，郭晓虎，等. 面向GPU的非结构网格有限体积计算流体力学的图染色方法优化［J］. 国防科技大学学报， 2022， 44（5）： 24-34.
	ZHANG X， SUN X， GUO X H， et al. Optimizations of graph coloring method for unstructured finite volume computational fluid dynamics on GPU［J］. Journal of National University of Defense Technology， 2022， 44（5）： 24-34 （in Chinese）.
15	GOMES P， ECONOMON T D， PALACIOS R. Sustainable high-performance optimizations in SU2［C］∥ Proceedings of the AIAA Scitech 2021 Forum. Reston： AIAA， 2021.
16	FARHAN M A AL， KEYES D E. Optimizations of unstructured aerodynamics computations for many-core architectures［J］. IEEE Transactions on Parallel and Distributed Systems， 2018， 29（10）： 2317-2332.
17	FOURNIER Y， BONELLE J， VEZOLLE P， et al. Multiple threads and parallel challenges for large simulations to accelerate a general Navier⁃Stokes CFD code on massively parallel systems［J］. Concurrency and Computation： Practice and Experience， 2013， 25（6）： 843-861.
18	THÉBAULT L， PETIT E， DINH Q. Scalable and efficient implementation of 3d unstructured meshes computation： A case study on matrix assembly［C］∥ Proceedings of the 20th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming. New York： ACM， 2015： 120-129.
19	STONE C P， WALDEN A， ZUBAIR M， et al. Accelerating unstructured-grid CFD algorithms on NVIDIA and AMD GPUs［C］∥ 2021 IEEE/ACM 11th Workshop on Irregular Applications： Architectures and Algorithms （IA3）. New York： IEEE Press， 2021： 19-26.
20	BALOGH G D， REGULY I Z， MUDALIGE G R. Comparison of parallelisation approaches， languages， and compilers for unstructured mesh algorithms on GPUs［C］∥ International Workshop on Performance Modeling， Benchmarking and Simulation of High Performance Computer Systems. Berlin： Springer， 2018： 22-43.
21	KARYPIS G， KUMAR V. A fast and high quality multilevel scheme for partitioning irregular graphs［J］. SIAM Journal on Scientific Computing， 1998， 20（1）： 359-392.
22	CUTHILL E， MCKEE J. Reducing the bandwidth of sparse symmetric matrices［C］∥ Proceedings of the 1969 24th national conference. New York： ACM， 1969： 157-172.
23	FOURNIER Y， BONELLE J， MOULINEC C， et al. Optimizing Code_Saturne computations on petascale systems［J］. Computers & Fluids， 2011， 45（1）： 103-108.
24	OLIKER L， LI X Y， HUSBANDS P， et al. Effects of ordering strategies and programming paradigms on sparse matrix computations ［J］. SIAM Review， 2002， 44（3）： 373-393.
25	LÖHNER R. Cache-efficient renumbering for vectorization［J］. International Journal for Numerical Methods in Biomedical Engineering， 2010， 26（5）： 628-636.
26	SULYOK A A， BALOGH G D， REGULY I Z， et al. Locality optimized unstructured mesh algorithms on GPUs［J］. Journal of Parallel and Distributed Computing， 2019， 134（C）： 50-64.
27	余永刚，周铸，黄江涛，等. 单通道客机气动标模CHN-T1设计［J］. 空气动力学学报， 2018， 36（3）： 505-513.
	YU Y G， ZHOU Z， HUANG J T， et al. Aerodynamic design of a standard model CHN-T1 for single-aisle passenger aircraft［J］. Acta Aerodynamica Sinica， 2018， 36（3）： 505-513 （in Chinese）.

E-mail：hkxb@buaa.edu.cn

关于我们

期刊社服务

专业学科

封面文章

友情链接

主管单位：中国科学技术协会主办单位：中国航空学会北京航空航天大学

方法	升力系数	阻力系数
原始	0.183 393	0.012 560 4
原子操作	0.183 393	0.012 560 4
面着色	0.183 393	0.012 560 4
组着色	0.183 393	0.012 560 4
规约	0.183 393	0.012 560 4
剖分复制	0.183 393	0.012 560 4
D&C树	0.183 393	0.012 560 4

线程	网格1计算时间			网格2计算时间
线程	融合前/ s	融合后/s	提升比例/%	融合前/ s	融合后/ s	提升比例/%
2	11.9	10.9	8.8	101.2	94.2	6.9
4	7.0	6.3	9.6	55.0	51.7	6.0
8	4.2	3.8	10.5	32.0	29.2	8.9
16	2.8	2.5	11.4	21.0	18.6	11.3
28	2.2	1.9	13.9	16.7	14.5	13.3
56	2.1	1.8	12.2	14.2	12.5	12.0

[1]	刘琦, 史勇杰, 胡志远, 徐国华. 共轴刚性旋翼气动及噪声特性的参数影响分析[J]. 航空学报, 2024, 45(9): 528856-528856.
[2]	崔壮壮, 原昕, 赵国庆, 井思梦, 招启军. 共轴刚性旋翼高速直升机前飞性能操纵策略影响[J]. 航空学报, 2024, 45(9): 529256-529256.
[3]	张东飞, 高军辉. GPU加速高阶谱差分方法在风扇噪声中的应用[J]. 航空学报, 2024, 45(8): 128941-128941.
[4]	杨华, 陈树生, 高正红, 姜权峰, 张伟. 基于贝叶斯框架的旋翼气动力数据融合[J]. 航空学报, 2024, 45(8): 128960-128960.
[5]	刘钒, 黄霞, 马率, 张露, 崔兴达, 鲍鑫彪. 基于MBD-CFD的软管-锥套空中加油仿真框架[J]. 航空学报, 2023, 44(20): 628408-628408.
[6]	康伟, 胡仕林, 王彦清. 介电弹性薄膜翼型的增升效应机理[J]. 航空学报, 2023, 44(18): 128318-128318.
[7]	刘衍旭, 陈树生, 冯聪, 顾奕然, 高正红. 高超声速滑翔飞行器锐边化气动隐身一体化设计[J]. 航空学报, 2023, 44(16): 128093-128093.
[8]	夏侯唐凡, 陈江涛, 邵志栋, 吴晓军, 刘宇. 随机和认知不确定性框架下的CFD模型确认度量综述[J]. 航空学报, 2022, 43(8): 25716-025716.
[9]	杨鲤铭, 吴杰, 董昊, 杜银杰. 基于速度空间非结构网格和守恒修正的改进离散速度方法[J]. 航空学报, 2022, 43(12): 627033-627033.
[10]	王海峰, 邓枫, 刘学强, 覃宁. 基于喷流作用的自然层流翼型阵风载荷减缓控制[J]. 航空学报, 2022, 43(11): 526767-526767.
[11]	阎超. 航空CFD四十年的成就与困境[J]. 航空学报, 2022, 43(10): 526490-526490.
[12]	李鹏, 陈坚强, 丁明松, 何先耀, 赵钟, 董维中. NNW-HyFLOW高超声速流动模拟软件框架设计[J]. 航空学报, 2021, 42(9): 625718-625718.
[13]	袁先旭, 陈坚强, 杜雁霞, 郭启龙, 肖光明, 傅亚陆, 梁飞, 涂国华. 国家数值风洞(NNW)工程中的CFD基础科学问题研究进展[J]. 航空学报, 2021, 42(9): 625733-625733.
[14]	陈坚强, 吴晓军, 张健, 李彬, 贾洪印, 周乃春. FlowStar:国家数值风洞(NNW)工程非结构通用CFD软件[J]. 航空学报, 2021, 42(9): 625739-625739.
[15]	任玉新, 王乾, 潘建华, 章雨思, 黄乾旻. 构建非结构网格高精度有限体积方法的新途径[J]. 航空学报, 2021, 42(9): 625783-625783.

面向多核CPU/众核GPU架构的非结构CFD共享内存并行计算技术

Shared⁃memory parallelization technology of unstructured CFD solver for multi⁃core CPU/many⁃core GPU architecture

RichHTML

PDF (PC)

可视化

摘要/Abstract

引用本文

使用本文

图/表 21

参考文献 27

相关文章 15

编辑推荐

Metrics

本文评价