北京生物医学工程

组学大数据环境下的基因变异信息并行处理与分析

Parallel information processing and analysis formutant gene under large data environments

作者：黄芝准王红强

单位：中国科学院合肥智能机械研究所(合肥230031)

关键词：第二代测序技术；Hadoop；序列数据分析；基因突变信息；单核苷酸多态性

分类号：R318.04

出版年·卷·期（页码）：2017·36·4（366-371）

摘要：

随着第二代测序技术的发展与应用，其产生的测序数据也呈现快速的增长趋势，如何有效、快速、稳定地对海量测序数据进行分析成为生物研究领域迫切的需求。目前许多传统的测序数据分析软件仅支持单一功能，并不具备完整的数据分析能力，应对海量的测序数据时其处理能力也显著不足。为了应对上述问题，本文设计了一款基于Hadoop框架的测序数据分析软件，整合了现今生物研究领域内常用的多款序列分析软件，从而实现了对测序序列数据的自动化分析。该软件输入原始的测序数据后，经过碱基质量控制、序列比对、SNP位点信息提取、突变基因信息生成等几个过程，最终输出详细的突变基因信息报告。该软件实现了自动化的数据分析，提高了数据分析的效率，极大减轻了数据分析人员的工作量。

With the development and application of biomedical techniques such as second generation of sequencing technology,the output data show rapid and steady growth trend.Efficient,rapid and steady analyzation of the massive sequencing data becomes an urgent need in the field of biological research.At present,many of the traditional sequencing data analysis softwares support only a single function,without complete data analysis capabilities.In order to solve the problems,this paper designs a sequencing data analysis software based on Hadoop framework,which integrates many kinds of sequence analysis software commonly used in the field of biological research,and realizes the automatic analysis of sequencing data.After inputting the original sequencing data,the software outputs several detailed information of mutant genes after several processes such as base quality control,sequence alignment,SNP information extraction,generation of mutant genetic information and so on.The software realizes automatic data analysis and improves the efficiency of data analysis.

参考文献：

［1］张如奎,徐增辉.浅论基因检测对肿瘤精准医疗的意义［J］.中国医药生物技术,2016,11(2):103-109.

［2］Langmead B,Trapnell C,Pop M,et al.Ultrafast and memory-efficient alignment of short DNA sequences to the human genome［J］.Genome Biology,2009,10(3):R25.

［3］Li H,Durbin R.Fast and accurate short read alignment with Burrows-Wheeler transform［J］.Bioinformatics,2009,25(14):1754-1760.

［4］Smith AD,Chung WY,Hodges E,et al.Updates to the RMAP short-read mapping software［J］.Bioinformatics,2009,25(21):2841-2842.

［5］Langmead B,Salzberg SL.Fast gapped-read alignment with Bowtie 2［J］.Nature Methods,2012,9 (4):357-359.

［6］Chang F,Dean J,Ghemawat S,et al.Bigtable:a distributed storage system for structured data［J］.ACM Transactions on Computer Systems,2008,26 (2):205-218.

［7］Ghemawat S,Gobioff H,Leung ST.The Google file system［J］.ACM Sigops Operating Systems Review,2003,37(5):29-43.

［8］Dean J,Ghemawat S.Mapreduce:simplified data processing on large clusters［J］.Conference on Symposium on Operating Systems Design and Implementation,2004,51(1):137-150.

［9］Li H.A statistical framework for SNP calling,mutation discovery,association mapping and population genetical parameter estimation from sequencing data［J］.Bioinformatics,2011,27 (21):2987-2993.

［10］Luo R,Liu B,Xie Y,et al.SOAPdenovo2:an empirically improved memory-efficient short-read de novo assembler［J］.GigaScience,2012,1(1):18.

［11］Hong D,Rhie A,Park SS,et al.Fx:an RNA-Seq analysis tool on the cloud［J］.Bioinformatics,2012,28 (5):721-723.

［12］Patel RK,Jain M.NGS QC Toolkit:a toolkit for quality control of next generation sequencing data［J］.Plos One,2012,7(2):e30619.

［13］Broad Institute.A set of Java command line tools for manipulating high-throughput sequencing (HTS) data［EB/OL］.(2016-09-05).http://broadinstitute.github.io/picard/.

［14］Herrero J,Muffato M,Beal K,et al.Ensembl comparative genomics resources［J］.Database(Oxford),2016，2016:bav096.

［15］Mclaren W,Gil L,Hunt SE,et al.The ensembl variant effect predictor［J］.Genome Biology,2016,17(1):122.

服务与反馈：

【文章下载】【加入收藏】

提示：您还未登录，请登录！点此登录