Background: Melanoma is one of the most lethal skin cancers, with survival rates largely dependent on early detection, yet diagnosis remains difficult because of its visual similarity to benign nevi. Convolutional neural networks have achieved strong performance in dermoscopic analysis but often depend on fixed input sizes and local features, which can limit generalization. Vision Transformers, which capture global image relationships through self-attention, offer a promising alternative.Methods: A ViT-L/16 model was fine-tuned using the ISIC 2019 dataset containing more than 25,000 dermoscopic images. To expand the dataset and balance class representation, synthetic melanoma and nevus images were produced with StyleGAN2-ADA, retaining only high-confidence outputs. Model performance was evaluated on an external biopsy-confirmed dataset (MN187) and compared with CNN baselines (ResNet-152, DenseNet-201, EfficientNet-B7, ConvNeXt-XL, ViT-B/16) and the commercial MoleAnalyzer Pro system using ROC-AUC and DeLong’s test.Results: The ViT-L/16 model reached a baseline ROC-AUC of 0.902 on MN187, surpassing all CNN baselines and the MoleAnalyzer Pro system, though the difference was not statistically significant (p= 0.07). After adding 46,000 confidence-filtered GAN-generated images, the ROC-AUC increased to 0.915, giving a statistically significant improvement over the commercial MoleAnalyzer Pro system (p= 0.032).Conclusions:Vision Transformers show strong potential for melanoma classification, especially when combined with GAN-based augmentation, offering advantages in global feature representation and data expansion that support the development of dependable AI-driven clinical decision-support systems.
背景:黑色素瘤是最致命的皮肤癌之一,其生存率在很大程度上取决于早期发现,但由于其与良性痣在视觉上相似,诊断仍然困难。卷积神经网络在皮肤镜分析中取得了优异表现,但通常依赖于固定的输入尺寸和局部特征,这可能限制其泛化能力。视觉Transformer通过自注意力机制捕捉图像全局关系,为这一问题提供了有前景的替代方案。 方法:使用包含超过25,000张皮肤镜图像的ISIC 2019数据集对ViT-L/16模型进行微调。为扩展数据集并平衡类别分布,采用StyleGAN2-ADA生成合成黑色素瘤与痣图像,仅保留高置信度输出。在经活检确认的外部数据集(MN187)上评估模型性能,并与CNN基线模型(ResNet-152、DenseNet-201、EfficientNet-B7、ConvNeXt-XL、ViT-B/16)及商用MoleAnalyzer Pro系统进行比较,采用ROC-AUC和DeLong检验进行统计分析。 结果:ViT-L/16模型在MN187数据集上的基线ROC-AUC达到0.902,优于所有CNN基线模型和MoleAnalyzer Pro系统,但差异未达到统计学显著性(p=0.07)。在加入46,000张经置信度筛选的GAN生成图像后,ROC-AUC提升至0.915,较商用MoleAnalyzer Pro系统实现统计学显著改进(p=0.032)。 结论:视觉Transformer在黑色素瘤分类中展现出强大潜力,特别是与基于GAN的数据增强技术结合时,其在全局特征表征和数据扩展方面的优势,有助于开发可靠的人工智能临床决策支持系统。