设为首页 |  加入收藏
首页首页 期刊简介 消息通知 编委会 电子期刊 投稿须知 广告合作 联系我们
基于SEER数据库利用机器学习方法分析乳腺癌的预后因素

Prognostic factors of breast cancer with machine learning method based on SEER database

作者: 章鸣嬛  张璇  郭欣  陈瑛 
单位:上海杉达学院大数据分析与处理研究中心(上海 201209)
关键词: SEER数据库;  乳腺癌;  Logistic回归;  决策树;  预后因素 
分类号:R318.04;Q334
出版年·卷·期(页码):2019·38·5(486-491)
摘要:

目的 以SEER数据库中1990—2014年间的乳腺癌数据为研究对象,利用机器学习方法,分析乳腺癌的预后因素,辅助医师对患者的预后进行有效评判。方法 根据临床医师的建议,筛选了12个字段作为模型输入字段,以术后5年生存状况作为模型输出字段。首先利用单因素统计分析方法初步筛选预后因素,再分别利用logistic回归和决策树两种机器学习分类算法进行建模分析,藉此寻找影响乳腺癌5年预后的因素。采用十折交叉法组织样本数据,并利用过抽样和欠抽样技术进行样本的平衡处理;以灵敏度、特异度及ROC下的AUC等参数作为模型的评价指标。结果 在12个模型输入字段中,肿瘤分期、肿瘤分级、肿瘤尺寸、雌激素水平、年龄分组、孕激素水平等因素对于乳腺肿瘤预后具有较大影响;在此两种模型下,模型测试集上的灵敏度和特异度均介于74.2%~78.2%之间,AUC均处于0.838~0.850之间。结论 利用logistic回归和决策树算法构建乳腺癌患者的优化预后模型,可辅助医师判断患者预后情况及治疗效果。

Objective On the basis of the breast cancer data from 1990 to 2014 in the SEER database, this paper is to study prognostic factors of breast cancer with machine learning method with a view to assisting doctors in evaluating the prognosis. Methods With the advice of clinicians, twelve fields are selected as model inputs; the 5-year survival status after surgery as model outputs. After developed, the prognostic factors are firstly primarily screened with the single factor statistical analysis method; the factors affecting the 5-year prognosis of breast cancer are explored by modeling and analyzing via the logistic regression and the decision tree, two kinds of machine learning classification algorithms. The sample data are processed with the ten-fold crossover method, and then are subject to equalization treatment by oversampling and under-sampling techniques; the evaluation criteria of the models developed include sensitivity, specificity, and the ROC curve areas (AUC). Results The twelve fields, such factors as tumor stage, tumor grade, tumor size, estrogen level, age grouping, and progesterone level have a great impact on the prognosis of breast tumors.The results from two models, both the sensitivity and specificity of the model test set are between 74.2% and 78.2%, and the AUC of the two models are between 0.838 and 0.850. Conclusions Optimal prognosis models developed with logistic regression and decision tree algorithms  can  assist doctors in assessing the prognosis and the treatment effect.

参考文献:

[1]      Siegel RL,Miller KD,Jemal A.Cancer statistics, 2017[J].A Cancer Journal for Clinicians,2017,67(1):7-30.

[2]      仲维兰,鲁美钰,司春枫,等.乳腺癌靶向治疗研究进展[J].现代肿瘤医学,2018,26(4): 622-626.

Zhong WL,Lu MY,Si CF,et al. Progress of research on targeted therapy for breast cancer[J]. Journal of Modern Oncology,2018,26(4): 622-626.

[3]      章鸣嬛,陈瑛,汪城,等.美国国立癌症研究所SEER数据库概述及应用[J].微型电脑应用,2015,31(12):26-32.

Zhang MH,Chen Y,Wang C,et al.Overview and application of SEER database of the national cancer institute [J].Microcomputer Applications,2015,31(12):26-32.

[4]      Jiang YZ, Liu YR, Yu KD,et al.Immediate postmastectomy breast reconstruction showed limited advantage in patient survival after stratifying by family income[J].Postmastectomy Reconstruction and Survival,2013,8(12):1-8.

[5]      Oweira H,Petrausch U,Helbling D,et al. Prognostic value of site-specific metastases in pancreatic adenocarcinoma: a surveillance epidemiology and end results database analysis[J].World Journal of Gastroenterology,2017,23(10):1872-1880.

[6]      Yang L,Takimoto T,Fujimoto J.Prognostic model for predicting overall survival in children and adolescents with rhabdomyosarcoma[J].BMC Cancer,2014, 14: 654.

[7]      冯婷婷,凌孙彬,赵亚珍,等.非功能型胰腺神经内分泌肿瘤手术预后分析——项基于SEER数据库的回顾性研究[J].中国肿瘤,2017, 26(11):910-914.

Feng TT, Ling SB Zhao YZ, et al. Prognostic factors of long-term outcome of non-functional pancreatic neuroendocrine neoplasms following surgical treatment: a retrospective study based on SEER database [J]. China Cancer, 2017, 26(11):910-914.

[8]      Kim W,Kim KS,Lee JE,et al. Development of novel breast cancer recurrence prediction model using support vector machine [J].Journal of Breast Cancer,2012, 15(2): 230-238.

[9]      Kim W,Kim KS,Park RW. Nomogram of naive bayesian model for recurrence prediction of breast cancer [J].Healthcare Informatics Research,2016, 22(2): 89-94.

[10]   刘雅琴.乳腺癌患者预后模型的研究[D].上海:上海交通大学,2008.

Liu Yaqin.Study on the prognosis model for breast cancer[D].Shanghai:Shanghai Jiao Tong University,2008.

[11]   尹玢璨,辛世超,张晗,等.基于SEER数据库应用贝叶斯网络构建亚洲肿瘤患者预后模型——以非小细胞肺癌为例[J].数据分析与知识发现,2017,(2):40-46.

Yin BC,Xin SC,Zhang H,et al. Building asian tumor-patients prognostic model with bayesian network and SEER database——case study of non-small cell lung cancer[J].Data Analysis and Knowledge Discovery,2017,(2):40-46.

[12]   牟冬梅, 任珂.三种数据挖掘算法在电子病历知识发现中的比较[J].现代图书情报技术,2016,(6):102-109.

Mu DM,Ren K.Discovering knowledge from electronic medical records with three data mining algorithms[J].New Technology of Library and Information Service,2016,(6): 102-109.

[13]   Sekkay F,Imbeau D,Chinniah Y,et al. Risk factors associated with self-reported musculoskeletal pain among short and long distance industrial gas delivery truck drivers[J].Applied Ergonomics,2018,72:69-87.

[14]   Tang S,Patrick ME.Technology and interactive social media use among 8th and 10th graders in the US and associations with homework and school grades[J].Computers in Human Behavior,2018,86:34-44.

[15]   Kardi T, Regina EM.Visualizing gait patterns of able bodied individuals and transtibial amputees with the use of accelerometry in smart phones[J]. Revista Colombiana de Estadística,2014,37(2):471-488. 

[16]   Kabir E,Guikema S,Kane B.Statistical modeling of tree failures during storms[J].Reliability Engineering & System Safety,2018,177:68-79.

[17]   陈翔,白创,黄跃俊.基于BP 神经网络的人脸识别系统研究[J].智能计算机与应用,2018,8(3):57-60.

Chen X,Bai C,Huang YJ.Research on face recognition system based on BP neural network[J].Intelligent Computer and Applications,2018,8(3):57-60.

[18]   Quan WZ, Wang K, Yan DM, et al. Distinguishing between natural and computer-generated images using convolutional neural networks[J]. IEEE Transactions on Information Forensics and Security.2018,13(11): 2772-2787.

[19]   李卫东. 应用统计学[M].北京:清华大学出版社.2014.

[20]   Fischer T,Krauss C.Deep learning with long short-term memory networks for financial market predictions[J]. European Journal of Operational Research,2018,270(2):654-669.

[21]   Mokeddem SA.A fuzzy classification model for myocardial infarction risk assessment[J]. Applied Intelligence,2018,48(5):1233-1250.

[22]   Doosti H, Hall P, Mateu J. Nonparametric tilted density function estimation: a cross-validation criterion [J]. Journal of Statistical Planning and Inference,2017,197:51-68.

[23]   苟军,胥化虎,杨桂松,等.MRI在乳腺癌腋窝前哨淋巴结转移的诊断价值[J].放射学实践, 2018,33(6):574-578.

Gou J,Xu HH,Yang JS,et al.The value of MRI in the diagnosis of axillary sentinel lymph node metastasis in breast cancer [J].Radiologic Practice,2018,33(6):574-578.

[24]   Villmann T, Kaden M, Hermann W,et al.Learning vector quantization classifiers for ROC-optimization [J]. Computational statistics,2018,33(3):1173-1194.

服务与反馈:
文章下载】【加入收藏
提示:您还未登录,请登录!点此登录
 
友情链接  
地址:北京安定门外安贞医院内北京生物医学工程编辑部
电话:010-64456508  传真:010-64456661
电子邮箱:llbl910219@126.com