This article introduces a new challenging task that caters more to real-world data acquisition, i.e., multimodal industrial surface defect detection with missing modalities caused by uncertain sensors availability. The proposed resilient MISDD-MM framework is characterized by:
Multimodal industrial surface defect detection (MISDD) aims to identify and locate defect in industrial products by fusing RGB and 3D modalities. This article focuses on modality-missing problems caused by uncertain sensors availability in MISDD. In this context, the fusion of multiple modalities encounters several troubles, including learning mode transformation and information vacancy. To this end, we first propose cross-modal prompt learning, which includes: i) the cross-modal consistency prompt serves the establishment of information consistency of dual visual modalities; ii) the modality-specific prompt is inserted to adapt different input patterns; iii) the missing-aware prompt is attached to compensate for the information vacancy caused by dynamic modalities-missing. In addition, we propose symmetric contrastive learning, which utilizes text modality as a bridge for fusion of dual vision modalities. Specifically, a paired antithetical text prompt is designed to generate binary text semantics, and triple-modal contrastive pre-training is offered to accomplish multimodal learning. Experiment results show that our proposed method achieves 73.83% I-AUROC and 93.05% P-AUROC with a total missing rate 0.7 for RGB and 3D modalities (exceeding state-of-the-art methods 3.84% and 5.58% respectively), and outperforms existing approaches to varying degrees under different missing types and rates. The source code will be available at https://github.com/SvyJ/MISDD-MM.
The overall flowchart of our proposed framework for MISDD-MM. It consists of three serial phases: (I) Missing modalities configuration, which produced three input patterns through three modalities-missing settings. (II) Cross-modal prompt learning, which includes three specially designed prompts with colored solid lines, i.e., cross-modal consistency prompt, modality-specific prompt, and missing-aware prompt. (III) Symmetric contrastive learning, which performs triple-modal contrastive pre-training to generate defect detection results. Prompt injection occurs at early transformer layers, where input tokens are prepended with three prompts based on current modality availability.
I-AUROC (%), P-AUROC (%), and AUPRO (%) scores on MVTec 3D-AD dataset under different missing modality / rate (η). Missing modality = "both" represents RGB images and 3D data are missing with the rate of η/2 respectively.
Category-level P-AUROC
Few-shot detection performance
@article{jiang2025resilient, title={Resilient Multimodal Industrial Surface Defect Detection with Uncertain Sensors Availability}, author={Jiang, Shuai and Ma, Yunfeng and Zhou, Jingyu and Wang, Yaonan and Liu, Min}, journal={IEEE/ASME Transactions on Mechatronics}, year={2025}, publisher={IEEE} }