北京生物医学工程

基于规则和机器学习的中文电子病历患者隐私保护算法

Patients privacy preserving algorithm of Chinese electronic medical record based on rule and machine learning

作者：王阳阳郑西川

单位：上海交通大学附属第六人民医院（上海 200033）上海交通大学生物医学工程学院（上海 200230）

关键词：隐私保护; 电子病历; 命名实体; 正则表达式; 隐马尔科夫模型

分类号：R318.04

出版年·卷·期（页码）：2019·38·5（492-497）

摘要：

目的针对医疗数据发布和共享中患者隐私泄露风险以及人工去标识效率低的问题，本文提出了一种基于规则和机器学习结合的算法，以有效去除电子病历中的患者隐私信息。方法根据美国健康可携行与责任性法案和中文电子病历的表达习惯，将隐私数据分为数字、日期及命名实体三大类，利用正则表达式识别数字以及日期隐私数据，引入隐马尔科夫模型识别命名实体。最后使用上海市第六人民医院的出院小结作为测试数据，利用留出法测试了隐私数据识别的召回率和精确率。结果该模型总体得到了超过90%的召回率，其中数字和日期类型的隐私数据召回率都超过96%，中文人名的识别效果也超过了单人识别的效果。结论规则和机器学习结合的模型有效地识别了患者的隐私数据，有助于医疗数据的共享。

Objective Aiming at the risk of patient privacy leakage and the low efficiency of manual de-identification in medical data publishing and sharing, this paper proposes a method based on rule and machine learning to remove effectively patient privacy information in electronic medical records. Methods According to the Health Insurance Portability and Accountability Act and the expression habits of Chinese electronic medical records, the privacy data is divided into three categories: numbers, dates and named entities. Regular expressions are used to identify numbers and date privacy data, and hidden Markov model is used to identify named entities. Lastly, we use discharges summaries from Shanghai Sixth People Hospital to evaluate the precision and recall using hold-out method. Results The model obtains overall recall more than 90%, including recall of digital and date privacy data is more than 96%, meanwhile, the recognition performance of Chinese names is also better than that of one person. Conclusions The model based on rules and machine learning effectively identifies patient's privacy data and helps to share medical data.

参考文献：

[1] 黄尤江, 贺莲, 苏焕群,等. 医疗大数据的应用及其隐私保护[J]. 中华医学图书情报杂志, 2015, 24(9):43-45.

Huang YJ, He L, Su HQ, et al. Application of big data in medical care and their privacy protection[J]. Chinese Journal of Medical Library and Information Science, 2015, 24(9):43-45

[2] 岳思,吴伟明,谷勇浩.数据发布中k-匿名隐私保护技术研究[J].软件,2017,38(11):12-17.

Yue S，Wu WM，Gu YH. Research on K-anonymous privacy protection technology in the data release[J]. Computer Engineering & Software,2017,38(11):12-17

[3] 何贤芒. 隐私保护中k-匿名算法和匿名技术研究[D]. 上海：复旦大学, 2011.

He XM. Study on K-anonymity algorithm and anonymity technology in privacy protection[D]. Shanghai：Fudan University, 2011.

[4] El EK, Dankar FK, Issa R, et al. A globally optimal k-anonymity method for the de-identification of health data[J]. Journal of the American Medical Informatics Association Jamia, 2009, 16(5):670-682.

[5] Nosowsky R, Giordano T J. The Health Insurance Portability and Accountability Act of 1996 (HIPAA) privacy rule: implications for clinical research[J]. Annual Review of Medicine, 2006, 57(1):575-590.

[6] Johnson AEW , Pollard TJ , Shen L , et al. MIMIC-III, a freely accessible critical care database[J]. Scientific Data, 2016, 3:160035.

[7] Douglass M , Cliffford G , Reisner A , et al. De-Identification algorithm for free-text nursing notes[J]. Computers in Cardiology , 2005，32:331 - 334.

[8] Neamatullah I , Douglass MM , Lehman LWH , et al. Automated de-identification of free-text medical records[J]. BMC Medical Informatics and Decision Making, 2008, 8:32.

[9] 徐益辉, 姚琴, 袁冬生. 中文医疗文本匿名化方法研究[J]. 中国数字医学, 2014, 9(7):19-21.

Xu XH, Yao Q, Yuan DS. Study on the anonymization method of Chinese medical document[J]. China Digital Medicine, 2014, 9(7):19-21

[10] Uzuner O, Sibanda TC, Luo Y, et al. A de-identifier for medical discharge summaries[J]. Artificial Intelligence in Medicine, 2008, 42(1):13-35.

[11] Y. Guo, R. Gaizauskas, I. Roberts, G et al. Identifying personal health information using support vector machines[C]. i2b2 workshop on challenges in natural language processing for clinical data, 2006,10-11.

[12] Mcmurry AJ, Fitch B, Savova G, et al. Improved de-identification of physician notes through integrative modeling of both public and private medical text[J]. BMC Medical Informatics and Decision Making, 2013, 13:112.

[13] He B , Guan Y , Cheng J , et al. CRFs based de-identification of medical records[J]. Journal of Biomedical Informatics, 2015, 58:S39-S46.

[14] Liu Z, Chen Y, Tang B, et al. Automatic de-identification of electronic medical records using token-level and character-level conditional random fields[J]. Journal of Biomedical Informatics, 2015, 58(Suppl): S47-S52.

[15] Sakharov A, Sakharov T. The Viterbi algorithm for subsets of stochastic context-free languages[J]. Information Processing Letters, 2018, 135:68-72.

[16] 张华平, 刘群. 基于角色标注的中国人名自动识别研究[J]. 计算机学报, 2004, 27(1):85-91.

Zhang HP, Liu Q. Automatic recognition of chinese personal name based on role tagging[J]. Chinese Journal of Computers, 2004, 27(1):85-91

服务与反馈：

【文章下载】【加入收藏】

提示：您还未登录，请登录！点此登录