航空学报 > 2022, Vol. 43 Issue (S1): 727001-727001   doi: 10.7527/S1000-6893.2022.27001

人与无人机集群多模态智能交互方法

苏翎菲1, 化永朝2, 董希旺2, 任章1   

  1. 1. 北京航空航天大学 自动化科学与电气工程学院,北京 100191;
    2. 北京航空航天大学 人工智能研究院,北京 100191
  • 收稿日期:2022-01-27 修回日期:2022-02-17 发布日期:2022-03-04
  • 通讯作者: 化永朝,E-mail:yongzhuahua@buaa.edu.cn E-mail:yongzhuahua@buaa.edu.cn
  • 基金资助:
    国防科工局基础预研项目(JCKY2019601C106)

Human-UAV swarm multi-modal intelligent interaction methods

SU Lingfei1, HUA Yongzhao2, DONG Xiwang2, REN Zhang1   

  1. 1. School of Automation Science and Electrical Engineering, Beihang University, Beijing 100191, China;
    2. Institute of Artificial Intelligence, Beihang University, Beijing 100191, China
  • Received:2022-01-27 Revised:2022-02-17 Published:2022-03-04
  • Supported by:
    Defense Industrial Technology Development Program (JCKY2019601C106)

摘要: 针对人与无人机集群交互式协同感知问题,借助深度学习技术,构建了基于语音和手势双模型自主识别集群编队协同控制的交互框架,并提出了一种基于双通道切换的通道融合机制,从而实现多模态交互。使用百度云平台基于流式多级截断注意力(SMLTA)的语音识别模型,采用深度学习平台进行了自训练,在应用场景下的准确率由80.10%提升至97.98%。结合Kinect V2的深度信息与骨骼信息,构建与训练了基于特征融合的卷积神经网络(CNN)手势识别模型,平均精确率为98.33%,相较于传统决策树模型提升了1.16%,相较于传统CNN模型提升了0.33%。最后,在机器人操作系统(ROS)-Gazebo训练场景下进行了仿真验证和实物验证。实验结果表明:提出的交互框架能有效控制无人机集群进行编队,语音通道、手势通道和通道切换的指令执行成功率均达90%以上,且具有较高的交互效率。

关键词: 深度学习, 人机交互, 无人机集群, 语音识别, 手势识别

Abstract: For the problem of human-UAV swarm interactive collaborative perception, an interactive framework for collaborative control of swarm formation based on dual-model autonomous recognition of speech and gesture is constructed with the idea of deep learning. A channel fusion mechanism based on dual channel switching is proposed to realize multimodal interaction. The speech recognition model based on Streaming Multi-Layer Truncated Attention (SMLTA) provided by the Baidu cloud platform is used, and the deep learning platform is applied for self-training. The accuracy rate increases from 80.10% to 97.98%. Combining the depth information and bone information of Kinect V2, a Convolutional Neural Network (CNN) gesture recognition model based on feature fusion is constructed and trained. The average precision of the model is 98.33%, which is 1.16% higher than that of the decision tree model, and 0.33% higher than that of the traditional CNN model. Simulation and physical verification are carried out in the Robot Operating System (ROS)-Gazebo training scenario. The results show that the proposed interactive framework can effectively control UAV swarm formation, and the command execution success rate of the voice channel, gesture channel and channel switching can reach more than 90%, and has a higher interaction efficiency.

Key words: deep learning, human-computer interaction, UAV swarm, speech recognition, gesture recognition

中图分类号: