This systematic review aims to address the research gap in the performance of computational algorithms for the digital image analysis of HER2 images in clinical settings. While numerous studies have explored various aspects of these algorithms, there is a lack of comprehensive evaluation regarding their effectiveness in real-world clinical applications. We conducted a search of the Web of Science and PubMed databases for studies published from 31 December 2013 to 30 June 2024, focusing on performance effectiveness and components such as dataset size, diversity and source, ground truth, annotation, and validation methods. The study was registered with PROSPERO (CRD42024525404). Key questions guiding this review include the following: How effective are current computational algorithms at detecting HER2 status in digital images? What are the common validation methods and dataset characteristics used in these studies? Is there standardization of algorithm evaluations of clinical applications that can improve the clinical utility and reliability of computational tools for HER2 detection in digital image analysis? We identified 6833 publications, with 25 meeting the inclusion criteria. The accuracy rate with clinical datasets varied from 84.19% to 97.9%. The highest accuracy was achieved on the publicly available Warwick dataset at 98.8% in synthesized datasets. Only 12% of studies used separate datasets for external validation; 64% of studies used a combination of accuracy, precision, recall, and F1 as a set of performance measures. Despite the high accuracy rates reported in these studies, there is a notable absence of direct evidence supporting their clinical application. To facilitate the integration of these technologies into clinical practice, there is an urgent need to address real-world challenges and overreliance on internal validation. Standardizing study designs on real clinical datasets can enhance the reliability and clinical applicability of computational algorithms in improving the detection of HER2 cancer.
本系统综述旨在填补临床环境中HER2图像数字化分析计算算法性能方面的研究空白。尽管已有大量研究探讨了这些算法的多个方面,但对其在真实世界临床应用中的有效性仍缺乏全面评估。我们检索了Web of Science和PubMed数据库中2013年12月31日至2024年6月30日发表的研究,重点关注性能有效性及数据集规模、多样性与来源、金标准、标注和验证方法等要素。本研究已在PROSPERO注册(CRD42024525404)。指导本综述的核心问题包括:当前计算算法在数字化图像中检测HER2状态的有效性如何?这些研究中常用的验证方法和数据集特征是什么?是否存在能提升数字化图像分析中HER2检测计算工具临床效用与可靠性的临床应用算法评估标准化方案?我们共筛选出6833篇文献,其中25篇符合纳入标准。临床数据集的准确率范围为84.19%至97.9%。在合成数据集中,公开的Warwick数据集取得了最高准确率98.8%。仅12%的研究使用独立数据集进行外部验证;64%的研究采用准确率、精确率、召回率和F1值作为综合性能评估指标。尽管这些研究报告了较高的准确率,但明显缺乏支持其临床应用的直接证据。为促进这些技术融入临床实践,亟需解决真实世界挑战及对内部验证的过度依赖问题。基于真实临床数据集的研究设计标准化,将有助于提升计算算法在改进HER2癌症检测方面的可靠性与临床适用性。