设为首页 |  加入收藏
首页首页 期刊简介 消息通知 编委会 电子期刊 投稿须知 广告合作 联系我们
基于Skip-gram词嵌入算法的结构化患者特征表示方法研究

Study on structured patient feature representation method based on Skip-gramword embedding algorithm

作者: 黄艳群  王妮  刘红蕾  费晓璐  巍岚  陈卉 
单位:首都医科大学生物医学工程学院(北京 100069) 首都医科大学临床生物力学应用基础研究北京市重点实验室(北京 100069) 首都医科大学宣武医院(北京 100053)
关键词: 电子病历;  Skip-gram算法;  特征表示;  自然语言处理;  词嵌入 
分类号:R318;TP31
出版年·卷·期(页码):2019·38·6(568-574)
摘要:

目的 基于表示学习中的Skip-gram词嵌入算法,寻找能够克服电子病历中结构化特征的高维性并在语义层次上表示特征的方法。方法 本文的数据来源于北京市某三甲医院的电子病历系统,从中提取患者的结构化特征,包括疾病,药物和实验室指标,其中实验室指标通过正常值范围离散化;利用Skip-gram算法,将电子病历中离散型患者特征(疾病和药物)和离散后的连续型患者特征(实验室指标)嵌入到同一个低维实数向量空间中。通过t-SNE降维可视化方法显示低维实数空间中特征向量的关系,并与特征向量间的余弦距离计算结果相互印证,从而评价特征表示的有效性和揭示特征向量间的潜在联系。结果 患者特征的低维实数向量既降低了患者特征的维度,又很好地表征了特征间的潜在联系,临床含义相关的特征表示成的低维实数向量也很相近。结论 基于Skip-gram算法将患者结构化特征表示成低维实数向量取得了较好的效果,为解决EMR数据表示的高维性以及结构化特征间潜在关系分析提供一种思路。

Objective To reduce the dimension of structured patient features in electronic medical records (EMR) system and to represent the patient features at a semantic level. Methods Data used in this study was derived from the EMR system of a tertiary hospital in Beijing, China. Three categories of structured patient features were extracted, including two discrete patient features (i.e., disease history and medications) and one continuous patient features (laboratory tests). These features were then represented as the concept vectors by being embedded into a unified low-dimensional vector space using Skip-gram algorithm. In order to evaluate the effectiveness of feature representation and reveal the potential relationship between features, t-SNE technology was used to visualize the concept space and cosine distances in concept vectors were calculated to reflect the relationship quantitively. Results The representation of concept vectors for patient features not only reduced the dimension of the traditional feature representation, but also revealed the potential relationship between features to some degree. Clinically relevant features were also close in the concept vector space. Conclusions Structured patient features can be represented as meaningful low-dimensional vectors based on the Skip-gram algorithm, providing a new idea for representing structured features in EMR.

参考文献:

[1] Girardi D, Wartner S, Halmerbauer G, et al. Using concept hierarchies to improve calculation of patient similarity[J]. Journal of Biomedical Informatics, 2016, 63: 66-73.

[2] Gottlieb A, Stein GY, Ruppin E, et al. A method for inferring medical diagnoses from patient similarities[J]. BMC Medicine, 2013, 11(1): 194.

[3] Bloomingdale P, Mager DE. Machine learning models for the prediction of chemotherapy-induced peripheral neuropathy[J]. Pharmaceutical Research, 2019, 36: 35.

[4] Lodhi MK, Ansari R, Yao Y, et al. Predictive modeling for comfortable death outcome using electronic health records[C]//Proceedings of 2015 IEEE International Congress on Big Data. New York, USA: IEEE Press, 2015: 409-415.

[5] Rodriguez-Lujan I, Bailador G, Sanchez-Avila C, et al. Analysis of pattern recognition and dimensionality reduction techniques for odor biometrics[J]. Knowledge-Based Systems, 2013, 52: 279-289.

[6] Bengio Y, Courville A, Vincent P. Representation learning: a review and new perspectives[J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2013, 35(8): 1798-1828.

[7] Mikolov T, Chen K, Corrado G, et al. Efficient estimation of word representations in vector space[EB/OL].[2019-09-06]. https://arxiv.org/abs/1301.3781.

[8] Mikolov T, Sutskever I, Chen K, et al. Distributed representations of words and phrases and their compositionality[J]. Advances in Neural Information Processing Systems, 2013, 26: 3111-3119.

[9] Choi E, Schuetz A, Stewart WF, et al. Medical concept representation learning from electronic health records and its application on heart failure prediction[EB/OL].[2019-09-06]. https://arxiv.org/abs/1602.03686.

[10] Tran T, Nguyen TD, Phung D, et al. Learning vector representation of medical objects via EMR-driven nonnegative restricted boltzmann machines (eNRBM)[J]. Journal of Biomedical Informatics, 2015, 54: 96–105.

[11] 张天齐, 卞鹰. 应用ICD-10编码辅助分析诊断质量[J]. 解放军医院管理杂志, 2017, 24(11): 1001-1004.

Zhang TQ,Bian Y. Auxiliary analysis of diagnosis quality by ICD--10 coding[J]. Hospital Administration Journal of Chinese People's Liberation Army, 2017, 24(11): 1001-1004.

[12] van der Maaten L. Accelerating t-SNE using tree-based algorithms[J]. Journal of Machine Learning Research, 2014, 15: 3221-3245.

[13] Cui L, Xie X, Shen Z. Prediction task guided representation learning of medical codes in EHR[J]. Journal of Biomedical Informatics, 2018, 84: 1-10.

[14] 郑刚. 糖尿病患者高血压管理的指南回顾及解读[J]. 世界临床药物, 2019, 40(3): 145-149.

Zheng G. Review and interpretation of the guidelines for hypertension management in diabetic patients[J]. World Clinical Drugs, 2019, 40(3): 145-149.

[15] Lei L, Zhou Y, Zhai J, et al. An effective patient representation learning for time-series prediction tasks based on EHRs[C]//2018 IEEE International Conference on Bioinformatics and Biomedicine (BIBM). Madrid, Spain: IEEE Press, 2018: 885-892.

[16] Zhou C, Jia Y, Motani M, et al. Learning deep representations from heterogeneous patient data for predictive diagnosis[C]// the 8th ACM International Conference. New York, USA, 2017: 115-123.

服务与反馈:
文章下载】【加入收藏
提示:您还未登录,请登录!点此登录
 
友情链接  
地址:北京安定门外安贞医院内北京生物医学工程编辑部
电话:010-64456508  传真:010-64456661
电子邮箱:llbl910219@126.com