MMR-AD: A Large-Scale Multimodal Dataset for Benchmarking General Anomaly Detection with Multimodal Large Language Models

1School of Information Science and Electronic Engineering, Shanghai Jiao Tong University   2MoE Key Lab of Artificial Intelligence, AI Institute, Shanghai Jiao Tong University  

Dataset Download

If you are interested in using the dataset, you can download it at Hugging Face Hugging Face or Model Scope Model Scope.

Dataset Organization

📂 MMR-AD数据集目录结构

  • 📁 MMR-AD
    • 📁 mvtec
      • 📁 bottle
        • 📁 ground_truth
        • 📁 test
        • 📁 test_good
        • 📁 train_good
        • 📄 bbox_annos.json
        • 📄 text_annos.json
        • 📄 reference.json
      • 📁 cable
        • 📁 ground_truth
        • 📁 test
        • ...
        • 📄 reference.json
      • 📁 carpet
      • 📁 grid
      • 📁 hazelnut
      • ...
  • ...

The MMR-AD dataset consists of multiple sub-datasets, each of which contains multiple classes. Taking bottle as an example, bbox-annos.json contains the original bbox annotations, text_annos.json contains the annotated text data, reference.json stores the reference image set for each image, ground_truth has mask annotations, test is divided into subdirectories according to anomaly types, and each subdirectory contains anomalous samples of the corresponding anomaly type, test_good contains the normal test samples, and train_good contains the normal training samples.

Collection of MMR-AD

The original images in the MMR-AD dataset are from publicly available AD datasets. We generated text data for these images to construct a large-scale multimodal AD dataset. To ensure data scale and diversity, we extensively collected and sampled from 14 publicly available AD datasets (see Tab.1). Specifically, the datasets we employed include MVTecAD, VisA, MVTecLOCO, MVTec3D, MPDD, GoodsAD, RealIAD, RealIAD-D3, MANTA, MIAD, CableInspect, WFDD, TextureAD, and 3CAD. During processing these datasets, we found that many samples in these datasets are of low quality. Thus, to ensure the quality of our MMR-AD dataset, we manually checked all the data (~ 190K original images) and removed low-quality samples. In addition, to assist in subsequent text generation and also for evaluating the model's ability to accurately locate anomalies, we further manually annotated the bounding boxes and text labels for the anomalous regions.

Interpolate start reference image.

Text Generation Pipeline

We propose an automatic pipeline that can leverage current strong MLLMs to efficiently generate text data for each sample. Our pipeline leverages the visual reasoning capabilities of Qwen2.5-VL-72B to first generate elaborate reasoning data and then output the answer. We provide a spatially-aligned nearest sample as the normal reference sample for each input sample and instruct the Qwen2.5-VL-72B to generate AD-related text data by comparing the input image with the reference image. Since Qwen2.5-VL-72B is primarily trained on natural images and may struggle with industrial anomaly detection, we further provide additional visual and textual hints. For the visual hints, we plot a red bounding box for each anomalous region on the input image to make the model aware of the locations of anomalies. The textual hints are composed of the bounding box coordinates of the anomalous regions and corresponding anomaly types, such as "the location and label of the abnormal area is ([xmin, ymin, xmax, ymax], 'broken')".

Interpolate start reference image.

Data Examples in MMR-AD

Interpolate start reference image.

Comparison with Other Multi-modal AD Datasets

Interpolate start reference image.

BibTeX


      @inproceedings{yao2026mmr,
        title={MMR-AD: A Large-Scale Multimodal Dataset for Benchmarking General Anomaly Detection with Multimodal Large Language Models},
        author={Xincheng Yao, Zefeng Qian, Chao Shi, Jiayang Song, Chongyang Zhang},
        booktitle={Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition},
        pages={00000--00000},
        year={2026}
      }