Background: Obtaining large amounts of real patient data involves great efforts and expenses, and processing this data is fraught with data protection concerns. Consequently, data sharing might not always be possible, particularly when large, open science datasets are needed, as for AI development. For such purposes, the generation of realistic synthetic data may be the solution. Our project aimed to generate realistic cancer data with the use case of laryngeal cancer. Methods: We used the open-source software Synthea and programmed an additional module for development, treatment and follow-up for laryngeal cancer by using external, real-world (RW) evidence from guidelines and cancer registries from Germany. To generate an incidence-based cohort view, we randomly drew laryngeal cancer cases from the simulated population and deceased persons, stratified by the real-world age and sex distributions at diagnosis. Results: A module with age- and stage-specific treatment and prognosis for laryngeal cancer was successfully implemented. The synthesized population reflects RW prevalence well, extracting a cohort of 50,000 laryngeal cancer patients. Descriptive data on stage-specific and 5-year overall survival were in accordance with published data. Conclusions: We developed a large cohort of realistic synthetic laryngeal cancer cases with Synthea. Such data can be shared and published open source without data protection issues.
背景:获取大量真实患者数据需投入巨大努力与成本,且数据处理过程常涉及数据保护问题。因此,数据共享并非总能实现,尤其在人工智能开发等需要大规模开放科学数据集的场景中。针对此类需求,生成逼真的合成数据可能成为解决方案。本研究旨在以喉癌为应用场景,生成具有真实性的癌症数据。 方法:我们采用开源软件Synthea,基于德国指南与癌症登记系统的真实世界证据,编程开发了针对喉癌病程发展、治疗及随访的附加模块。为构建基于发病率的队列视图,我们从模拟人群及已故个体中随机抽取喉癌病例,并按照真实世界诊断时的年龄与性别分布进行分层处理。 结果:成功构建了包含喉癌年龄与分期特异性治疗及预后模块。合成人群良好反映了真实世界患病率,共提取出5万例喉癌患者队列。分期特异性数据及5年总生存率的描述性统计结果均与已发表数据吻合。 结论:我们利用Synthea成功构建了大规模逼真的合成喉癌病例队列。此类数据可在规避数据保护问题的前提下实现开源共享与公开发布。
Generation of a Realistic Synthetic Laryngeal Cancer Cohort for AI Applications