设为首页 |  加入收藏
首页首页 期刊简介 消息通知 编委会 电子期刊 投稿须知 广告合作 联系我们
一种面向小样本数据的错标记样本识别方法

A mislabeled sample recognition method for small sample data

作者: 秦瑞斌  郑浩然  周宏 
单位:中国科学技术大学计算机科学与技术学院(合肥230027)
关键词: 错标记;小样本数据;微阵列 
分类号:
出版年·卷·期(页码):2012·31·6(574-578)
摘要:

目的 针对小样本数据的错标记问题,本文在CL-stability算法的基础上提出一种加权的错标记样本识别算法(UCL-stability)。方法 在UCL-stability算法中,根据样本标记翻转后数据所能选出的差异特征数目,定义了一个投票权值用于衡量翻转不同样本标记对分类的影响。结果 两组癌症基因表达数据的实验结果表明,UCL-stability与CL-stability算法均能有效识别数据中的可疑样本。通过人为错标记样本的进一步实验,显示UCL-stability算法相比于无投票权的CL-stability算法可取得较高的precision和recall值。结论 本文提出的UCL-stability算法不仅考虑了小样本数据中单个样本的标记错误对分类器设计造成的影响,更进一步考虑了不同样本的标记错误对分类结果影响的差异。通过引入特征信息衡量该差异,UCL-stability取得了较好的结果。

Objective To propose a new method UCL-stability based on the CL-stability method to solve the mislabeled sample problem. Methods According to the number of significant differential features (after sample label flipping),UCL-stability proposes a voting weight in order to measure the effects of flipping different samples’ label. Results The experimental results of two cancer microarray data sets indicate that both UCL-stability and CL-stability can recognize the suspect samples effectively. The further experiments of artificial mislabeling show that UCL-stability can obtain a higher value of precision and recall. Conclusions The UCL-stability algorithm not only considers the effects of a single sample’s mislabeling,but also distinguishes the effects of different samples’ mislabeling. In order to measure the effects quantitatively,we employ the feature information and achieve preferable results.

参考文献:

[1]Alon U,Barkai N,Notterman DA,et al. Broad patterns of gene expression revealed by clustering analysis of tumor and normal colon tissues probed by oligonucleotides array [J]. Proceedings of the National Academy of Sciences of the United States of America,1999,96:6745-6750.
[2]West M,Blanchette C,Huang E,et al. Predicting the clinical status of human breast cancer by using gene expression profiles [J]. Proceedings of the National Academy of Sciences of the United States of America,2001,98:11462-11467.
[3]West M. Bayesian factor regression models in the ‘Large p,Small n’ paradigm [J]. Bayesian Statistics,2003,7:723-732.
[4]Brodley CE,Friedly MA. Identifying mislabeled training data [J]. Journal of Artificial Intelligence Research,1999,11:131-166.
[5]Muhlenbach F,Lallich S,Zighed DA. Identifying and handling mislabeled instances [J].Journal of Intelligent Information Systems,2004,22:89-109.
[6]Venkataraman S,Metaxas D,Fradkin D,et al. Distinguishing mislabeled data from correctly labeled data in classifier design [C]. In 16th IEEE International Conference on Tools with Artificial Intelligence (ICTAL’04),2004:668-672.
[7]Malossini A,Blanzieri E,Ng RT. Detecting potential labeling errors in microarrays by data perturbation [J]. Bioinformatics,2006,22:2114-2121.
[8]Zhang C,Wu C,Blanzieri E,et al. Methods for labeling error detection in microarrays based on the effect of data perturbation on the regression model [J]. Bioinformatics,2009,25:2708-2714.
[9]Zhang W,Rekaya R,Bertrand K. A method for predicting disease subtypes in presence of misclassification among training samples using gene expression:application to human breast cancer [J]. Bioinformatics,2006,22:317-325.
[10]Barnett V,Lewis T. Outliers in Statistical Data [M]. New York:John Wiley and Sons,1994.
 

服务与反馈:
文章下载】【加入收藏
提示:您还未登录,请登录!点此登录
 
友情链接  
地址:北京安定门外安贞医院内北京生物医学工程编辑部
电话:010-64456508  传真:010-64456661
电子邮箱:llbl910219@126.com