Outlier Analysis for Gene Expression Data

期刊名字：计算机科学技术学报
文件大小：354kb
论文作者：Chao Yan，Guo-Liang Chen，Yi-Fei
作者单位：National High Performance Computational Center
更新时间：2020-11-22
下载次数：次

论文简介

Jan. 2001, Vol.19, No.1, pp.13- -21J. Comput. Sci. & Technol.Outlier Analysis for Gene Expression DataChao Yan, Guo Liang Chen, and Yi-Fei ShenTP3 ANational High Performance Computational Center, Universily of Science and Technology of China, Hefei 230027P.R. ChinaE- mail: hyperyan@sina.comReceived May 26, 2003; revised August 13, 2003.Abstract The rapid developments of Lechnologies that geucrate arrays of gene data cnable a global view ofthe transcription levels of hundreds of thousands of genes simulaneously The outlier dlelection problem for genedata has its importance but tougethor with the difculty of high dimensionality. The sparsity of data iu bhigh-dinensional space makes each point a relatively good outlicr in the view of traditional dist ance- based definitions.Thus, finding outliers in high dimensional data is Imore complex. In this paper, some basic outlicr aualysisalgorithrns are discussed and a new genetic algorithmu is presented. This algorithm is to find best diuensionprojections based on a revised cell-based algorithrn and to give explanatious to solutions. It can solve the outlierdetection probleun for gene expression data and for other high dimensional data as well.Keywords gene expression data, outlier analysis, cell-based algorithm, genetic algorithmIntroductionThey should not be near-complermentary to otherRNAs that may be highly abundant in the sample.1.1 Biological BackgroundsThe Oligonucleotide Fingerprinting (ONF)method preceded the other Lwo. 'The ONF methodGene data are extracted from germ plasm is based on spoting the cDNAs on high-density ny-I living creatures. Raw dlata include DNA sC~ lon memnbranes. Large quantities of short synthcticquences and protein squences. Such data are oligocs about 7比12 bases long are radioactivelysuitable for some application such as alignment labeled and put in touch with the membrane inand coluparison. Yet for some other applications proper positions. The oligoes hybridize lo thosesuch as clustering and outlier detection, charac- cDNAs (hat contains a DNA sequence complemen-teristics other than raw data are nceded. Noveltary to that of the oligo. The hybridization infor-technologies of cDNA microarraysl- 则， oligonw- mation is rather noisy because of the short oligvescleotide microarrays:5- 7 and the Oligonucleotide used, but it can be compensatcd by using a longerFingerprinting8- 4 pernit large-scale screenings fingerprint. The ONF method is less eficient thanfor patterns of gene expression and give quantitive the other two, but it has the advantage of applica-readouts on expresion level of thousands of genes. bility to species with unknown genomes.[15] gives a description of more details about theseTypically these technologies gencrate finger-technologies.print data -a vector of rcal values for each genc.cDNA microarrays are microscopic arrays co1- Sorme applications need similarity data or dissi-taining large sets of cDNA sequences immobilized milarity data, which can be computed from fin-on a solid substrate. May gene-specific cDNAs gerprint data. The similarity or dissimilarity dataare spotted on a single matrix. The matrix is then represent pairwise similarity or dissimnilarity. Forprobed with fuoresccntly tagged cDNA represen- example, some clustering algorithrns (e.g., [16. 17])tations of total RNA pools from test and reference need pairwise distances as input.cells. In a square centimeter array's with over tenthousand cDNAs can be generated.In oligonucleotide microarrays, each spot on1.2 Outlier Analysisthe array countains a short synthetic oligonucleotideDM (Data Mining) is an essential process where(oligo) about 20 to 30 bases long. The oligoes are intelligent mnethods,applied in order to ex-designed based on the knowledge of the DNA or tract data patternsIt is cosidered a stepEST target sequences to ensure high afiuity and in the knowlcdge discovery process. Knowledgespecificity of each oligo to a particular target gene.discovery tasks can be classificd into four generalThis work was supported by the National High-Tech Development 863 Program of China under the GrantNo.200AAl1 1041 2002AA104560, and igh-Standard Universitics Construction Project in CAS.中国煤化工MYHCNMHGJ. Compul. Sci. & Technol, Jan. 2004, Vol. 19, No.1categoriesi19|: dependency detection, class identifi-as longer scquences can be put into test. Our test-cation, class description and exception/outlier de-ing gene data has the dimensionalities of 17 and 96.tection. Apparently the first three deal with pat-In some case there may be hundreds of dimensions.terns suitable for most objects in databases whileAn outlier detection algorithm for gene expressionthe last corresponds to outlier analysis.dala must. have the ability to deal with high dimen-What is an outlier? Hawkins' definition{20 cap-sionality.tures the spirit:“An outlier is al observation thatThis paper is orgarized as follows. Section 2deviates so much from other observations as to reviews current outlier detection algorithms. Inarouse suspicions that it was generated by a differ-Section 3 we introduce our genctic algorithn forent mechanism." Outliers are sormetimes nontriv-high-dimensional data outlier detection. In Sectionial in spite of their sparsity. Exceptional behaviors 4 the algorithm is tested on gene expression datamay innply more valuable information than regularand results and discussions are given. Section 5 isoncs. Ten regular ones only imply one regular rulethe conclusion.but yct ten exceptional ones may imply ten differ-ent and potential systemn defections. F inding an2 Related Workoutlying financial record may reveal fraud crime.Outliers in gene cxpression data may also be crt-Althoughi outlier detection is a relatively ncwcial. A gene exception may yield 1) no harm and noarea in data mining, there have been quite a lot ofbenefit or 2) a harmful genetic defection (such aseficient algorithms for it.henophilin) or 3} improvements (such as immunilyStatistical-based outlier detection!28} assumesto a certain disease). The last two are 1o doubt ofa distribution or probability model for the givenmost biological importance.Outlier analysis is similar to clustering whichdatasct and views those as outliers which dlevi-finds chusters containing similar records. Althoughate severcly from the distribution. Those outliersmay be generated by another distribution, thus rC-some clustering algorithms (e.g-, [21- -27]) can bevealing another mechanism. This method has aapplied lo outlier detection, they are actually in-drawback that a presumed distribution is needed.sensitive to outliers as they are mainly for clusters.Yet in many cases, data mining deals with thoseTheir results of outlier analysis are often inaccu-data whose internal mechanism is still unkuown torate. Outlier analysis needs its own algorithrns.people. Another drawback is it can hardly dealThere have been some algoithunsl19,28- -34] forwith multi-dimcnsional data. Analysis of multi-outlicr analysis in recent years. Yet these algo-dimensional distribution is quite complex.rithuns are all vulnerable to high dimensionality.Density- based outlier detection29,30 is basedAggarwal pointed out in [35] the desiderata for high on distancebased methods. The density is decideddinensional outlier detection algorithms as follows.upon distances and the number of points in a cer-●They should devise techniques to efectivelytain arca. [29, 30] also define“local outliers" Co1Il-handle the sparsity problems in high dimensionalparcd with“global outliers". Aggarwal35l givesan outlier detection algorithm for high dimensionalo They should provide interpretability in termsdata by combining the concepts of density and sub-of the rcasoning that creates the abnormality.space projection.●Proper measures must be identifed in orderDeviation -based outlier detection{31-33] exanl-to account for the plysical significance of the defi-ines the main characteristics of objects and de-nition of an outlier in k-dimensional subspace.termines outliers. Arning31] uscs the sequential●They should continue to be computationally technique for detection. Sarawagil32] introduces aefficient for very high dimensional problems.discovery-driven method for identifying exceptions .●They shonld provide importance to the localin data using OLAP data cubes. Jagadish') givesdata behavior while detcrmining whether a point isan efficient method for mining deviants in timcan outlier.series databases.Gene expression data are inherently linked toDistance- based outlier detection[19,34] views thehigh dimensionality. Biological experiments oftendata as points in high-dimension space and outlierspinpoint quite a large quantity of segments ina se- are those far away from other points. This methodquence, thus leading to the inevitability of highdi- is independent of a certain distribution and canmensional data. Additionally, more advanced tech-deal with Imulti dimensional data. Yet for high-nologies may yield data with higher dimensionalty dimensional data, the sparsity deprives distances中国煤化工MYHCNMHGChao Yan et al: Outlier Analysis for Gene Expression Data15of their meanings. The distances from a certain(b) If there are > M objects in Cx,gULr(x,y),point to all other points varies little. A solutionisBlnone of the objects in Cx,u is an outlier;is the projection of high-dimensional data to sub-(c) If there are≤M objects in Cxy∪Lr(x.y)∪spaces. Some algorithmsl2731 use this mcthod for L2(x, y), etvery object in Cxr.y is an outlier.clustering. Aggarwal in [35] uses it for outlier de-Fig.1 represents a cell-based algorithm for out-A distance- based algorithn can be described aslier detection, where m represents the total numberof cells.DB(p, D), where an object O in a dataset T is anFor a k-dimensioual space, the length I of a di-outlier if at least fraction pof the objectsin T lies agonal of a k-dimensional cell is D)/2√k. The com-greater than distance. A most efficient distance-plexilty o[ the algorihru is O(ck+ N), wlere c isbased algorithm is the cell-based algorithm[49] thatsome constant depending on Vk and m//k, roughlyavoids computing distances between each two ob-the lnumber of cells along each dimension. .jects. A 2-dimnensional space is partitioned intocells of length l = D/2vk. Let Cxu denote the3 A Genetic Algorithm for Outlier Detec-cell that is at the intersection of row c and columny. The layer 1 (L1) neighbors of Cxny and L2 are tiondefined as:L(Cx,y)= {Cu,olx-1≤u≤x+ 1,y-1≤3.1 Overview of Genetic Algorithmv≤y+ 1,Cu.o≠Cxy}Genetic algorithms or evolutionary algori-L2(Cx.g) = {Ciu,v|xc-3≤u≤x+3,y-3 dis-is impractical due to the computing conplexity.tance D apart.Let N be the number of objects in dataset T,Alternatively some approximate solutions can wokwell. Each solution is an individual, typically inM=N(1-p)the form of a string. A fitress funclion evaluatesProperty 4.(a) If there ure > M objects in Cr,y; none ofthe fitness of an individual. Individuals in the pop-ulation are matched for crossover to generate off-the objects in Cx,y is an outlier;spring, Mutations occur to the offspring at a cer-tain possibility. If the offspring is quite good, it1. For q= 1.2...m. Countuy = 0.. For each object P; map P to a proper cill Cg.ay replace worst oncs in the population. The eVo-store P and increase Countg by 1.lution process continues until the individunals in the. Forq= 1,2...m,. if Counta > M, label Cq red.best set are more and more genetically sinmilar to4. For each red cell Cr, label each of the L neigthbors ofeach other. De Jong in [40] defincs the convergenceCr pink, provided the neighbor has not already beenlabeled red.as 95% of the population has the sane value. Then. For each non-empty white (uncolored) cell Cw, do:best individuals can be selected as soluitions.u) Conuntwa = Countw + S:∈1.1 (Cu) Counlib) If Countw2 > M, label Cw pinkElse3.2 Descriptions of Our Algorithm1) Counlu3 = Countu2 + Sie L2(Cu) Counti2) If Conuntwra ≤M, mark all objccts in Cw as ouliersWe will now discuss our genetic algorithm foroutlier analysis in higl-dimensional spacc. LetFor cach object P E Cuw, do:A = AlxA2X... x An be a d-dirnensional nu-1) Counlp = Countw22) For ceach object Q∈L2(Cw), if dist(P,Q)≤D,merical space. The input data are N vectorsincrcase Counlp by IVi = (1/,172...i.). i = 1,2...N.Con-) if Countp≤M, mark P as an outlier.sider a projection of datasct to a subspace CFig.1. Pseudo-code of a ell-based algorithm for the dimen- Atl x Ar2X...X Atk, wherek.70 -81.38] Holland J H. Adaptation in natural and artificial sys-Guo-Liang Chen is a professor, academician oftems. Universily of Michigan Press, Ann Arbor MI the Chinese Academy of Sciences. He works with the1975.Dept. of Computer Sci. & Tech, University of Science[39] Chen G L. Genetic algorithm and its applications. Neand Technology of China. His major Research areas0] De Jong K A. Analysis of the behavior of a class oftional Defense Press, 1996. (in Chinese)include parallel theory and algorithn.genctic adaptive systems| Dissertation]. University ofYi-Fei Shen is a Ph.D. candidate of Dept, of Com-Michigan, Ann Arbor, MI, 1975[41] Aggarwal C C, OrlinJ B, Tai R P. Optimized crossoverputer Sci. & Tech., University of Science and Technol-for the independ此problem. Operations Research,ogy of China. His major intercsts include bioinformat-1997, 45(2). "ics and algorithm.中国煤化工MYHCNMHG

论文截图