北京生物医学工程

基于Tesseract的医学化验单内容识别技术

Recognition technology of the laboratory sheet based on Tesseract

作者：张淙悦尹梓名孙大运戴维

单位：上海理工大学医疗器械与食品学院（上海 200093）

关键词：化验单; 光学字符识别; 图像处理; 错误校正

分类号：R318.04；TP391.5

出版年·卷·期（页码）：2019·38·3（283-289）

摘要：

目的由于化验单内容可以真实地记录患者健康状态，因此将纸质的化验单转为医疗电子档案进行存储在进行保险理赔、转院、远程会诊、建立健康档案时都具有重要作用。但目前在临床上尚缺乏能识别化验单内容，把化验单直接转成医疗电子档案的工具，为此本文设计了一套完整的自动化医学化验单内容的光学字符识别（optical character recognition，OCR）识别方法。方法首先对化验单图像进行预处理，利用大津法对化验单图像进行二值化、用霍夫变换对图像进行抗扭斜和特征提取，然后使用Tesseract的集束搜索算法和K邻近算法对化验单内容进行识别，对字库进行训练，利用医学词典文件与模糊字文件来对识别内容进行纠错，并以此建立医学化验单OCR引擎。最后利用从上海某社区医院收集的302条化验单数据对OCR引擎的准确率进行了评估。结果经评估验证，本文方法的识别准确率为92.72%，可基本满足临床需求。结论基于Tesseract建立的医学化验单OCR引擎可以免去手动输入化验单数据的麻烦，医生仅需要拍照上传化验单照片，即可将化验单中的内容转成结构化医疗电子档案，极大提高了医生的工作效率，有助于数据的进一步利用。

Objective As the contents of the laboratory sheet can truly record patients’ health status, it plays an important role to convert the paper laboratory sheet into medical electronic files for storage in insurance claims, transfer, remote consultation, and establishment of health records. However, there is no tool to identify the contents of laboratory sheet and convert the laboratory sheet directly into structured medical electronic files at present. For this reason, this paper designs a complete optical character recognition（OCR）identification methods for automatic identification of medical laboratory sheet. Methods First, the image of laboratory sheet was preprocessed, binarized by Otsu method. A deskew and feature extraction was performed by Hough transform, then the content of laboratory sheet was identified by Tesseract's beam search algorithm and K-neighboring algorithm, the word bank was trained, and the recognition content was corrected by the medical dictionary file and the unicharambigs file. Based on this, an OCR engine for laboratory sheets was built. Finally, the accuracy of OCR engine was evaluated by using 302 laboratory sheets collected from a community hospital in Shanghai. Results The recognition accuracy of this method was 92.72%, which could basically meet the clinical needs. Conclusion The OCR engine based on Tesseract can avoid the trouble of manually inputting the laboratory sheet data. Doctors only need to take photos of laboratory sheets and upload these photos by internet, the OCR engine can transform the contents of the laboratory sheet into structured medical electronic files, which greatly improves the efficiency of doctors and helps to further use the data.

参考文献：

[1] 王宸敏. 基于OCR技术的化验单识别方法研究[D]. 杭州：浙江大学, 2016.

Wang CM,. Research on the method of laboratory sheet recognition based on OCR technology [D]. Hangzhou: Zhejiang University, 2016.

[2] 黄宇. OCR技术在金融领域的应用[J]. 金融电子化, 2001，(1):86-88.

[3] 陈晨. 智能交通系统中车牌识别的关键技术研究[D]. 南京：南京理工大学, 2014.

Chen CH. Research on key technologies of license plate recognition in intelligent traffic system [D]. Nanjing: Nanjing University of Science and Technology, 2014.

[4] 张巍. 基于Android平台的名片扫描识别系统的设计与实现[D]. 长春：吉林大学, 2015.

Zhang W. Design and implementation of business card scanning recognition system based on Android platform [D]. Changchun: Jilin University, 2015.

[5] 刘泳文. 基于图像识别的搜题系统的研究与实现[D].南充：西华师范大学,2016.

Liu YW. Research and implementation of searching test system based on image recognition [D].Nanchong:China West Normal University, 2016.

[6] 万松.基于Tesseract-OCR的名片识别系统的研究与实现[D]. 广州：华南理工大学，2014.

Wan S. Research and implementation of business card recognition system based on Tesseract-OCR engine[D]. Guangzhou:South China University of Technology, 2014

[7] 郭佳, 刘晓玉, 吴冰,等. 一种光照不均匀图像的二值化方法[J]. 计算机应用与软件, 2014, (3):183-186.

Guo J, Liu XY, Wu B. Binarisation method for images acquired under non-uniform illumination [J]. Computer Applications and Software, 2014(3):183-186

[8] 罗松, 王俊峰, 唐鹏,等. 面向条码识读的自适应二值化改进算法[J]. 计算机工程与设计, 2013, 34(4):1324-1330.

Luo S, Wang JF, Tang P, Improved adaptive thresholding algorithm used in barcode reading[J]. Computer Engineering and Design, 2013, 34(4):1324-1330.

[9] 武玉坤. 基于OCR技术的名片识别系统的研究[D]. 长沙：长沙理工大学, 2008.

Wu YK. Research on business card recognition system based on OCR technology [D]. Changsha:Changsha University of Science and Technology, 2008.

[10] 邬满. 基于跳变检测和Tesseract的机打发票识别算法[J]. 信息与电脑(理论版), 2015，(18):43-45.

[11] Smith RW . History of the Tesseract OCR engine: what worked and what didn't[C]// Proceedings of SPIE Document Recognition and Retrieval. San Francisco: SPIE， 2013.

[12] Tesseract ocr wiki[EB/OL]. [2018-09]

https://github.com/tesseract-ocr/tesseract/wiki

[13] Quehl B, Yang H, Sack H. Improving text recognition by distinguishing scene and overlay text[C]// International Conference on Machine Vision. San Diego: International Society for Optics and Photonics, 2015.

[14] Improve quality[EB/OL]. [2018-09] https://github.com/tesseract-ocr/tesseract/wiki/ImproveQuality/

[15] Smith R, Antonova D, Lee DS. Adapting the Tesseract open source OCR engine for multilingual OCR[C]// International Workshop on Multilingual Ocr. Barcelona : ACM,2009:1.

服务与反馈：

【文章下载】【加入收藏】

提示：您还未登录，请登录！点此登录