对未知目标的实时感知与持续跟踪是智能系统自主决策的重要前提,在实际应用中存在缺乏目标类别先验信息和训练样本匮乏的问题,使得未知目标的感知与跟踪更具挑战性。针对此问题,提出了一种基于任意分割模型(Segment Anything Model,SAM)与稀疏特征点匹配的未知目标跟踪方法。该方法首先通过提示点引导SAM模型感知并分割图像中的未知目标,随后利用基于卷积神经网络的特征点提取模型,获取目标图像的稀疏特征点作为目标信息,并通过基于注意力机制的匹配网络在后续帧中匹配这些特征点,完成目标信息传播。在此基础上,设计了一个基于特征点一致性的迭代式SAM模块(ISPC),利用匹配的特征点持续引导SAM模型对后续图像帧的目标进行分割,从而实现未知目标的稳定跟踪。此外基于稀疏特征点的轻量化目标信息,可以在多智能体之间高效共享,构建了一个协同式目标跟踪系统。在DAVIS 2017数据集和自构建的近红外视频数据集上,评估了系统的目标跟踪性能与零训练样本目标的泛化能力。实验结果表明,该方法在处理未知类别目标的协同感知与跟踪任务中,表现出良好的鲁棒性和准确性。
Real-time perception and continuous tracking of unknown objects are critical for autonomous intelligent systems. However, the absence of prior category knowledge and limited training samples make this task highly challenging. We propose a cate-gory-agnostic object tracking method based on the Segment Anything Model (SAM) and sparse feature point matching. The approach first guides SAM to segment unknown objects using prompt points, then extracts sparse keypoints via a CNN-based model, and matches them across frames using an attention-based network to propagate object information. An Iterative SAM with Point Consensus (ISPC) is introduced to maintain segmentation and achieve stable tracking over time. The light-weight feature representation enables efficient sharing among multiple agents, supporting collaborative tracking. Experiments on the DAVIS 2017 dateset and a self-constructed near-infrared video dataset demonstrate strong robustness and accuracy in tracking unknown-category objects.
[1]CAELLES S, MANINIS K K, PONT-TUSET J, et al.One-shot video object segmentation[C]//Proceedings of the IEEE conference on computer vision and pat-tern recognition. 221-230.
[2]WU J, JIANG Y, BAI S, et al.Seqformer: Sequential transformer for video instance segmenta-tion[C]//European Conference on Computer Vision. Cham: Springer Nature Switzerland, 2022: 553-569.
[3]CHENG H K, OH S W, PRICE B, et al.Putting the object back into video object segmenta-tion[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2024: 3151-3161.
[4]WANG X, WANG W, CAO Y, et al.Images speak in images: A generalist painter for in-context visual learning[C]//Proceedings of the IEEE/CVF Confer-ence on Computer Vision and Pattern Recognition. 2023: 6830-6839.
[5]WANG X, ZHANG X, CAO Y, et al. Seggpt: Seg-menting everything in context[J]. arXiv prep.[J].rXiv:2304.03284, 2023., rint, :-
[6]JABRI A, OWENS A, EFROS A.Space-time corre-spondence as a contrastive random walk[J]. Advances in neural information processing systems, 2020, 33: 19545-19560.[J].Advances in neural information processing systems, 2020, (33):19545-19560
[7]CARON M, TOUVRON H, MISRA I, et al.Emerging properties in self-supervised vision transform-ers[C]//Proceedings of the IEEE/CVF international conference on computer vision. 2021: 9650-9660.
[8]KIRILLOV A, MINTUN E, RAVI N, et al.Segment anything[C]//Proceedings of the IEEE/CVF interna-tional conference on computer vision. 2023: 4015-4026.
[9]RAVI N, GABEUR V, HU Y T, et al. Sam 2: Segment anything in images and videos[J]. arXiv prep.[J].rXiv:2408.00714, 2024., rint, :-
[10]YANG J, GAO M, Li Z, et al. Track anything: Seg-ment anything meets videos[J]. arXiv prep.[J].rXiv:2304.11968, 2023., rint, :-
[11]CHENG Y, LI L, XU Y, et al. Segment and track any-thing[J]. arXiv prep.[J].rXiv:2305.06558, 2023., rint, :-
[12]CHENG H K, SCHWING A G.Xmem: Long-term video object segmentation with an atkinson-shiffrin memory model[C]//European Conference on Com-puter Vision. Cham: Springer Nature Switzerland, 2022: 640-658.
[13]YANG Z, YANG Y.Decoupling features in hierar-chical propagation for video object segmentation[J]. Advances in Neural Information Processing Systems, 2022, 35: 36324-36336.
[14]ZHONG S, LI G, YING W, et al.Efficient Semi-Supervised Object Segmentation for Long-Term Vid-eos Using Adaptive Memory Network[J]. IEEE Trans-actions on Cognitive and Developmental Systems, 2024.
[15]RAJI? F, KE L, TAI Y W, et al.Segment anything meets point tracking[C]//2025 IEEE/CVF Winter Con-ference on Applications of Computer Vision (WACV). IEEE, 2025: 9302-9311.
[16]HARLEY A W, FANG Z, FRAGKIADAKI K.Particle video revisited: Tracking through occlusions using point trajectories[C]//European Conference on Com-puter Vision. Cham: Springer Nature Switzerland, 2022: 59-75.
[17]DETONE D, MALISIEWICZ T, RABINOVICH A.Superpoint: Self-supervised interest point detection and description[C]//Proceedings of the IEEE confer-ence on computer vision and pattern recognition workshops. 2018: 224-236.
[18]SARLIN P E, DETONE D, MALISIEWICZ T, et al.Superglue: Learning feature matching with graph neural networks[C]//Proceedings of the IEEE/CVF conference on computer vision and pattern recogni-tion. 2020: 4938-4947.
[19]SARLIN P E, CADENA C, SIEGWART R, et al.From coarse to fine: Robust hierarchical localization at large scale[C]//Proceedings of the IEEE/CVF confer-ence on computer vision and pattern recognition. 2019: 12716-12725.
[20]FISCHLER M A, BOLLES R C.Random sample con-sensus: a paradigm for model fitting with applications to image analysis and automated cartography[J].Communications of the ACM, 1981, 24(6):381-395
[21]QUIGLEY M, CONLEY K, GERKEY B, et al.ROS: an open-source Robot Operating System[C]//ICRA workshop on open source software. 2009, 3(3.2): 5.
[22]ESTER M, KRIEGEL H P, SANDER J, et al.Density-based spatial clustering of applications with noise[C]//Int. Conf. knowledge discovery and data mining. 1996, 240(6).