Background: Electronic Health Records (EHRs) encompass valuable data essential for disease prediction. The application of artificial intelligence (AI), particularly deep learning, significantly enhances disease prediction by analyzing extensive EHR datasets to identify hidden patterns, facilitating early detection. Recently, numerous foundation models pretrained on extensive data have demonstrated efficacy in disease prediction using EHRs. However, there remains some unanswered questions on how to best utilize such models, especially with very small fine-tuning cohorts. Methods: We utilized Med-BERT, an EHR-specific foundation model, and reformulated the disease binary prediction task into a token prediction task and a next visit mask token prediction task to align with Med-BERT’s pretraining task format in order to improve the accuracy of pancreatic cancer (PaCa) prediction in both few-shot and fully supervised settings. Results: The reformulation of the task into a token prediction task, referred to as Med-BERT-Sum, demonstrated slightly superior performance in both few-shot scenarios and larger data samples. Furthermore, reformulating the prediction task as a Next Visit Mask Token Prediction task (Med-BERT-Mask) significantly outperformed the conventional Binary Classification (BC) prediction task (Med-BERT-BC) by 3% to 7% in few-shot scenarios with data sizes ranging from 10 to 500 samples. These findings highlight that aligning the downstream task with Med-BERT’s pretraining objectives substantially enhances the model’s predictive capabilities, thereby improving its effectiveness in predicting both rare and common diseases. Conclusions: Reformatting disease prediction tasks to align with the pretraining of foundation models enhances prediction accuracy, leading to earlier detection and timely intervention. This approach improves treatment effectiveness, survival rates, and overall patient outcomes for PaCa and potentially other cancers.

摘要翻译：

背景：电子健康记录（EHR）包含对疾病预测至关重要的宝贵数据。人工智能（AI）尤其是深度学习的应用，通过分析大量EHR数据集以识别隐藏模式，显著提升了疾病预测能力，有助于实现早期检测。近年来，众多基于海量数据预训练的基础模型在利用EHR进行疾病预测方面展现出显著效果。然而，如何最优利用此类模型，特别是在微调样本量极有限的情况下，仍存在待解决的问题。方法：本研究采用专为EHR设计的Med-BERT基础模型，将疾病二分类预测任务重构为词元预测任务和下一就诊掩码词元预测任务，以契合Med-BERT的预训练任务范式，从而提升胰腺癌预测在少样本及全监督场景下的准确性。结果：将任务重构为词元预测任务（称为Med-BERT-Sum）在少样本场景及较大数据样本中均表现出轻微优势。更重要的是，将预测任务重构为下一就诊掩码词元预测任务（Med-BERT-Mask）在10至500个样本的少样本场景中，其性能较传统二分类预测任务（Med-BERT-BC）显著提升3%至7%。这些发现表明，使下游任务与Med-BERT预训练目标对齐能大幅增强模型预测能力，从而提升其对罕见病和常见病的预测效能。结论：通过重构疾病预测任务以匹配基础模型的预训练机制，可有效提升预测准确度，实现更早的疾病检测与及时干预。该方法有望改善胰腺癌及其他癌症的治疗效果、生存率及患者整体预后。

原文链接：

Advancing Pancreatic Cancer Prediction with a Next Visit Token Prediction Head on Top of Med-BERT

……

文章目录

文章：

基于Med-BERT的下一就诊标记预测头推进胰腺癌预测研究

Advancing Pancreatic Cancer Prediction with a Next Visit Token Prediction Head on Top of Med-BERT

原文发布日期：4 February 2025

DOI: 10.3390/cancers17030516

类型: Article

开放获取: 是

英文摘要：

摘要翻译：

原文链接：

相关文章

关于我们

官方邮箱

商务合作