航空学报 > 2017, Vol. 38 Issue (1): 320107-320107   doi: 10.7527/S1000-6893.2016.0126

适用于空间通信的LDPC码GPU高速译码架构

侯毅, 刘荣科, 彭皓, 赵岭, 熊庆旭   

  1. 北京航空航天大学 电子信息工程学院, 北京 100083
  • 收稿日期:2016-01-25 修回日期:2016-04-25 出版日期:2017-01-15 发布日期:2016-05-05
  • 通讯作者: 刘荣科,E-mail:rongke_liu@buaa.edu.cn E-mail:rongke_liu@buaa.edu.cn
  • 基金资助:

    国家自然科学基金(91438116)

High-throughput GPU-based LDPC decoder architecture for space communication

HOU Yi, LIU Rongke, PENG Hao, ZHAO Ling, XIONG Qingxu   

  1. School of Electronics and Information Engineering, Beihang University, Beijing 100083, China
  • Received:2016-01-25 Revised:2016-04-25 Online:2017-01-15 Published:2016-05-05
  • Supported by:

    National Natural Science Foundation of China (91438116)

摘要:

鉴于目前空间通信对高速、可重配置信道译码器的需求,利用图形处理器(GPU)的并行化运算特点,提出了一种低密度奇偶校验(LDPC)码软件高速译码架构。通过优化Turbo消息传递译码(TDMP)算法节点更新运算线程块内和块间并行度、减少非规则行重造成的线程分支、降低线程对节点更新信息存储资源的访问延时以及合理量化译码器存储信息来提升译码内核函数的执行效率。并在此基础上引入异步统一计算设备构架(CUDA)流处理机制,设计优化的译码器输入输出数据传输和内核函数之间的执行调度方式以及CUDA流上的译码线程资源配置方式,最大化译码吞吐率的同时降低译码延时。在Nvidia最新的Tesla K20和GTX980平台上对国际空间数据系统咨询委员会(CCSDS)遥测标准LDPC码进行的TDMP译码实验结果表明,本架构进行10次迭代译码的吞吐率最高可达约500 Mbps,平均译码延时约为2 ms左右。与现有结果相比,本架构在保持软件架构配置灵活性的同时更加有效的兼顾了译码吞吐率和延时性能。

关键词: 低密度奇偶校验码, 图形处理器, 软件译码架构, Turbo消息传递译码算法, 高吞吐率, 低延时

Abstract:

In view of the current requirements for high-speed reconfigurable channel decoder for space communications, a high-throughput low-density parity-check(LDPC) software decoding architecture is proposed by exploiting the graphics processing units (GPU)'s parallel operating characteristics. The efficiency of the decoding kernel functions is improved by optimizing the inter-block and intra-block thread parallelism for the nodes' updating operations in software decoding architecture; turbo-decoding message passing (TDMP) algorithm, reducing the thread branch induced by the irregularity of row-weight, lowering the memory access latency for the updating information by threads, and reasonably quantizing the stored information to. The asynchronous compute unified device architecture (CUDA) stream processing mechanism, which includes designing an optimized execution scheduling between decoder's input/output data transfers and kernel functions, and setting a thread resource allocation method on CUDA streams, is also introduced to maximize the decoding throughput and at the same time reduce the decoding latency. The experimental results from the decoding simulations of the Consultative Committee for Space Data System (CCSDS) telemetry standard's LDPC codes on the Nvidia's latest Tesla K20 and GTX980 platforms demonstrate that the proposed architecture achieves about 500 Mbps maximum throughput and about 2 ms average latency by using TDMP algorithm with 10 iterations. In comparison with the existing results, the proposed architecture can improve both the decoding throughput and latency performance, and maintain the configuration flexibility of software architecture.

Key words: low-density parity-check codes, graphics processing units, software decoding architecture, Turbo-decoding message passing algorithm, high-throughput, low latency

中图分类号: