Breast cancer is the most common type of cancer worldwide. Alarmingly, approximately 30% of breast cancer cases result in disease recurrence at distant organs after treatment. Distant recurrence is more common in some subtypes such as invasive breast carcinoma (IBC). While clinicians have utilized several clinicopathological measurements to predict distant recurrences in IBC, no studies have predicted distant recurrences by combining clinicopathological evaluations of IBC tumors pre- and post-therapy with machine learning (ML) models. The goal of our study was to determine whether classification-based ML techniques could predict distant recurrences in IBC patients using key clinicopathological measurements, including pathological staging of the tumor and surrounding lymph nodes assessed both pre- and post-neoadjuvant therapy, response to therapy via standard-of-care imaging, and binary status of adjuvant therapy administered to patients. We trained and tested four clinicopathological ML models using a dataset (144 and 17 patients for training and testing, respectively) from Duke University and validated the best-performing model using an external dataset (8 patients) from Dartmouth Hitchcock Medical Center. The random forest model performed better than the C-support vector classifier, multilayer perceptron, and logistic regression models, yielding AUC values of 1.0 in the testing set and 0.75 in the validation set (p< 0.002) across both institutions, thereby demonstrating the cross-institutional portability and validity of ML models in the field of clinical research in cancer. The top-ranking clinicopathological measurement impacting the prediction of distant recurrences in IBC were identified to be tumor response to neoadjuvant therapy as evaluated via SOC imaging and pathology, which included tumor as well as node staging.
乳腺癌是全球范围内最常见的癌症类型。令人担忧的是,约30%的乳腺癌病例在治疗后会出现远端器官的疾病复发。远端复发在某些亚型中更为常见,例如浸润性乳腺癌。虽然临床医生已利用多种临床病理学指标来预测浸润性乳腺癌的远端复发,但目前尚无研究将治疗前后的肿瘤临床病理学评估与机器学习模型相结合来预测远端复发。本研究的目标是确定基于分类的机器学习技术是否能够利用关键的临床病理学指标预测浸润性乳腺癌患者的远端复发,这些指标包括新辅助治疗前后评估的肿瘤及周围淋巴结的病理分期、通过标准成像评估的治疗反应以及患者是否接受辅助治疗的二元状态。我们使用杜克大学的数据集(分别包含144名和17名患者用于训练和测试)训练并测试了四个临床病理学机器学习模型,并利用达特茅斯希区柯克医学中心的外部数据集(8名患者)对表现最佳的模型进行了验证。随机森林模型在测试集中AUC值为1.0,在验证集中为0.75(p<0.002),其表现优于C-支持向量分类器、多层感知器和逻辑回归模型,从而证明了机器学习模型在癌症临床研究领域具有跨机构可移植性和有效性。影响浸润性乳腺癌远端复发预测最重要的临床病理学指标被确定为通过标准成像和病理学评估的新辅助治疗肿瘤反应,其中包括肿瘤及淋巴结的分期。