Classification analysis of microarray data based on ontological engineering

期刊名字：浙江大学学报A（英文版）
文件大小：427kb
论文作者：LI Guo-qi，SHENG Huan-ye
作者单位：Department of Computer Science and Engineering
更新时间：2020-12-06
下载次数：次

论文简介

638Liet al./ J Zhejiang Univ SciA 2007 8():638-643Journal of Zhejiang University SCIENCE AISSN 1673 565X (Print); ISSN 1862- 1775 (Online)www.zju.edu.cn/jzus; www.springerlink.comJzusE-mail: jzus@zju.edu.cnClassification analysis of microarray databased on ontological engineeringLI Guo-qi, SHENG Huan-ye(Department of Computer Science and Engineering, Shanghai Jiao Tong University, Shanghai 200240, China)E-mail: liguoqi@sjtu.edu.cnReceived Feb. 11, 2006; revision accepted Sept. 21, 2006Abstract: Background knowledge is important for data mining, especially in complicated situation. Ontological engineering isthe successor of knowledge engineering. The sharable knowledge bases built on ontology can be used to provide backgroundknowledge to direct the process of data mining. This paper gives a common introduction to the method and presents a practicalanalysis example using SVM (support vector machine) as the classifier. Gene Ontology and the accompanying annotationscompose a big knowledge base, on which many researches have been carried out. Microarray dataset is the output of DNA chip.With the help of Gene Ontology we present a more elaborate analysis on microarray data than former researchers. The method canalso be used in other fields with similar scenario.Key words: Ontological engineering, Data mining, Microarray, Support vector machine (SVM)doi: 10.1631/jzus.2007.A0638Document code: ACLC number: TP391INTRODUCTIONthis section, we first present a brief introduction toontology, ontological engineering and its applicationThere is a Chinese saying that“Lessosns learmed in biology, Gene Ontology. Then the background offrom the past can guide one in the future”. With themicroarray gene expression data analysis is described.accumulation of knowledge, data mining is not anFinally, a data mining experiment on microarray dataisolated mission. What we knew should be clear be- based on Gene Ontology is designed and will be de-fore we explore new knowledge. Infrmation tech- scribed in detail in the rest of the paper.nology has been collaborating with traditional indus-tries extensively and deeply, so data mining needsOntological engineering and Gene Ontologyprevious understanding of domain specific knowl-From the computational point of view, Ontologiesedge. The sharable knowledge bases built on ontologyare agreements on shared conceptualizations. Sharedcan be used to provide background knowledgeconceptualizations include conceptual frameworks forautomatically to direct the process of data mining. Inmodeling domain knowledge, content- specific proto-the field of bioinformatics, Gene Ontology and thecols for communication among inter-operating agents,annotations compose a big knowledge base, on which and agreements on the representation of particularmany researches have been carried out. Microarray domain theories (Gruber, 1995). Ontology is used as .dataset is the output of DNA chip. With the asistance an explication of knowledge or an organizer ofof Gene Ontology we present a more elaboratemetadata. There are two large differences between theanalysis on microarray data than former researches. Inroles of an ontology for knowledge bases and thosefor metadata: one is philosophical and the other is" Project (No. 20040248001) supported by the Ph.D. Programs Foun-practical. The philosophical one is that an ontology,dation of Ministry of Education of Chinafor knowledge中国煤化工he concep-YHCNMH G ..Li et al.1J Zhejiang Univ SciA 2007 8():638-643639tualization of the target world; and the practical one, logical researchers.for metadata, is a set of computer-understandableIt seems that ontological engineering in biologyvocabulary (Mizoguchi, 2003). In the first kind of field is more developed than any other, maybe exceptcases, ontology works as a system of fundamental semantic web, because of the characteristics ofconcepts, that is, a protocol specification of any biological research itself. Gene and protein sequenceknowledge base, explicating the conceptualization of databases have voluminous data and continue to morethe target world and providing a solid foundation, on than double in size every year (Roos, 2001; Benson etwhich one can build sharable knowledge bases for al, 2006). Facing the huge number of entities andwider usability than that of a conventional knowledge their relationships, biologists resorted to the advancedbase. And in the other kind of cases, ontology was knowledge management method, ontological engi-used as a tool for data retrieval and exchange between neering, to provide services. So, our example selectedheterogeneous databases.bioinformatics research issue. The method can also beThe Gene Ontology project provides a controlled sound in other fields with similar scenario.vocabulary to describe gene and gene productattributes in any organism (Ashburner et al., 2000). It Introduction of microarray gene expression datawas organized into three categories, or have three and their analysisontologies: biological process, cellular componentWith the rapid development of genome-scaleand molecular function. There are nearly 166k genes sequencing, many genomes have already been known.that have been annotated by Gene Ontology. An An essential and formidable task is to define the roleontology comprises a set of well-defined terms with of each gene and understand how the genome func-well-defined relationships. The structure itself tions as a whole. As we know, in molecular level,reflects the current representation of biological know- DNA is transcribed into messenger ribonucleic acidledge as well as serving as a guide for organizing new or mRNA and then mRNA is translated to produce adata. Data can be annotated to varying levels de- protein. In fact, only a small part of segments in thepending on the amount and completeness of available genome exist in the code for genes. Most of the func-information (Ashburner et al, 2000). The original tional roles of other segments are still unknown. It isintent of the Gene Ontology project was to constructa impossible to discover the mystery just fromset of vocabularies comprising terms that we could analyzing the sequence data. To study the relation-share with a common understanding of the meaning ships of molecular entities on system level, DNAof any term used, and that could support cross chips were introduced that can simultaneouslydatabase queries. It soon became obvious that the measure the expression levels of thousands of genescombined set of annotations from the model organism in cell (Allison, 2005). Microarray data are the outputgroups would provide a useful resource for the entire of DNA chips. In the microarray dataset, each columnscientific community. Therefore, in addition to de- represents a gene and each row means a“frozen pic-veloping the shared structured vocabularies, the Geneture”of their expression level in a series of continu-Ontology project is developing a database resource ous conditions or samples with different characters.that provides access not only to the vocabularies, but These microarray datasets typically have a largealso to annotation and query applications and to number of columnsspecialized datasets resulting from the use of the example, many gene expression datasets may containvocabularies in the annotation of genes and/or geneup to 10000~100000 columns but only 100~1 000products (The Gene Ontology Consortium, 2001). It rows (Brown et al, 2000), because the number ofbecame a large knowledge base for molecular biology genes is big and that of cell samples is usually limited. .research. As described above, Gene Ontology is not aInitial experiments (Eisen et al., 1998) suggestknowledge base itself but the databases annotated by that genes of similar function yield similar expressionGene Ontology are knowledge bases. The technique patterns in microarray experiments. As data fromof ontology makes it an open, specified andsuch experiments accumulate, it will be essential to .computer- understandable knowledge base which have accurate means for extracting biological signifi-minimizes the gap between computational and bio- cance and usin中国煤化工ns to genesYHCNMHG.640Liet al. / J Zhejiang Univ SciA 2007 8(4):638-643(Brown et al, 2000). The former researches on this essence， the method searches the backgroundfield can be divided into three categories briefly. The knowledge contained in the Gene Ontology and theearliest method is cluster analysis, which is an unsu- annotation database to give an organized conclusionpervised fashion, learning in the absence of a teacher to the genes in the microarray in a cluster fashion.signal. This kind of methods begin with a definition ofThere are other research issues successfully car-similarity, or a measure of distance between expres- ried out by the aid of Gene Ontology, such as proteinsion patterns, but with no prior knowledge of the true subellular location (Chou and Cai, 2003). But theirfunctional classes of the genes. Genes are thenmost active use is in the analysis of microarray data.grouped by using a clustering algorithm such as hi-The reason is the character of microarry data structure.erarchical clustering (Eisen et al, 1998) or self-Microarray datasets typically have a large number oforganizing maps (Tamayo et al, 199)， But the columns but a small number of rows. So the rela-limitation of cluster analysis is obvious. Differenttionships of entities in the analysis are more importantkinds of analysis scenarios have to share the samethan those in the analyses on traditional datasets. Themetric of similarity such as Correlation or Euclideanproblem then is urgent starvation for backgrounddistance. However, the choice of an appropriate dis-knowledge. The ontological method in microarraytance metric is critical in order to reveal true under-data analysis is popular and effective. Different fromlying expression patterns beneath the samples (Phanmost Gene Ontology researches which typically useet al., 2004). So researchers resorted to supervisedartificial intelligence method, such as reasoning andmethod, such as SVM (support vector machine)deduction, we pay attention on how ontological en-(Brown et al, 2000) or other statistical methods togineering provides background knowledge and di-achieve more sensitive analysis. Supervised learmingrects the process of data mining.techniques use a training set to specify in advancewhich data should cluster together. As applied to geneFinding the unknown function of genes in a mi-expression data, a set of genes that have a commonfunction and a separate set of genes that are knowncroarray dataset with data miningnot to be members of the functional class is specified.Genes of similar function are known to yieldThese two sets of genes are combined to form aset of similar expressions ptterms in microarray experiments,training examples in which the genes are labelledwith supervised learning techniques having beenpositively if they are in the functional class andsuccessfully used in the analysis of microarray data-negatively if they are known not to be in the func-sets. So we can now utilize the background knowl-tional class. Using this training set, a clasifier would edge in the Gene Ontology, mainly using the bio-learn to discriminate between the members and logical process component, and its annotation fornon-members of a given functional class based on deeper and more accurate research into microarrayexpression data. Having learmed the expression fea- datasets. The reason why the biological process on-tures of the class, the classifier could recognize new tology was selected lies in the character of the testgenes as members or as nonmembers of the class dataset, which will be detailedly described in Sectionbased on their expression data.2. We first search every gene biological process in theAdditional to the cluster and classification Gene Ontology and draw the subgraph of the ontol-method, analyses based on Gene Ontology are also ogy with node only related to the genes concerned.used to explore microarray datasets (Pavlidis et al.,The subgraph can be seen as a taxonomic tree ac-2004). A researcher presented a method of clustering cording to the biological issues of the genes. Everylists of genes mined from a microarray dataset using node is annotated with a biological process name andfunctional information from the Gene Ontology the genes set with the function. From the global(Kennedy et al., 2004). The method uses relationships background knowledge we can construct a set ofbetween terms in the ontology both to build clusters classifiers to predict unknown functions of genes inand to extract meaningful cluster descriptions. In the microarray dataset on different levels.中国煤化工YHCNMH G ..Liet al. /J Zhejiang Univ SciA 2007 8(4):638-643641DATASET AND ItS BACKGROUND KNOWL-| Biological processEDGE IN GENE ONTOLOGYt Physiological processH-MetabolismTo ilustrate our method, we use a microarrayH Cllular metabolismdataset that had been used in many similar researches.+4 Heterocycle metabolism'The dataset has 79-element gene expression vectors「Generation of precursor metabolitesfor 2467 yeast genes. The data were generated froml and energyspotted arrays using samples collected at various time| 。[ Energy derivation by oxidationpoints during the diauxic shift, the mitotic cell divi-I of organie compoundssion cycle, sporulation, temperature and reducingH Cllular rspirainshocks (ttp://rana.stanford.eduyclustering) (BrownH Cllular physiological processet al, 2000).←Cell division4 FCell budding'Before data analysis, normalization is an im-portant step with microarray data. Conveniently, theret←t Cellular processare standard algorithms and software for solving theH ReproduetionHAsexualreroductionproblem. Cluster3, an open source tool (available athttp://rana.lbl.gov/EisenSoftware.htm), can be usedH Development气Regulation of gene expression, epigenetic'to normalize microarray data. The algorithms aret Regulation of biological processdescribed in detail in the attached document of thesoftware.Fig.1 Part of subgraph of obtained gene ontology.After normalization, input the genes into GeneThe four nodes marked with asterisk are selectedto carry out our analysisOntology and its annotation database, we got thesubgraph of the ontology with node only related to thegenes of interest. The information can also be repre-Table 1 Categories and number of samplessented as a gene-category matrix. In the matrix, eachCategory IDProcess nameNumber of samplescolumn means a microarray gene, which has alsoCellular respiration53been annotated in the category of biological processHeterocycle metabolism50of Gene Ontology with each row of delegates repre-Regulation of gene ex-senting a function term in the hierarchy of Gene On-pression, epigeneticCell budding63tology. We ignore the relationship between the nodesand view the result as a taxonomic tree in molecularfunction issue. The information ignored will be re- the other used for testing the accuracy of the corre-considered in the biological analysis in the following sponding classifier. Then we can answer such ques-research. Part of the subgraph is illustrated in Fig.1. tions as whether we can predict that the gene par-We select four nodes marked with asterisk to carry ticipates in the biological process ID 1 or ID 2, if we .out our analysis. In fact, only the four nodes and their knew it participated in one of them.upper level nodes are shown in Fig.1.SVM is selected as the classifier in (Burges,1998), as it has proven to have high performance inthe classification of microarray data (Brown et al,CLASSIFICATION STRATEGIES2000). Practically, we used LIBSVM, which is anintegrated software for support vector classification,Four biological processes were selected. The (C- SVC, nu-SVC), regression (epsilon-SVR, nu-SVR)biological process names and the number of genes and distribution estimation (one-class SVM) (Changannotated to each node are shown in Table 1. Every and Lin, 2001).two classes are assembled as a group. Six independentBefore training the SVM, we must decide whichbinary classifiers were distributed to each group re-kernel should be selected and then the penalty paspectively. We divided the samples into two parts，rameter C and kernel parameters are chosen. The RBFone of which is used for training the classification and kernel K(x.,y)=中国煤化工n to havefYHCNMHG.642Liet al. / J Zhejiang Univ SciA 2007 8(4):638-643many advantages (Burges， 1998; Chang and Lin，edge and data, the application of ontological engi-2001) and is suitable for microarray data analysis neering and data mining based on it will be more(Brown et al, 2000). After selection of kernel, popular. We will focus our following researches onfive-fold cross-validation was used to find the best C this field.and γ. At last, the best parameters were used to trainand test the classifiers by five-fold cross-validation. ReferencesThe result is shown in Table 2.Allison, D.B., 2005. DNA Microarrays and Related GenomicTechniques: Statistical Design, Analysis, and Interpreta-Table 2 The best parameters and classification accuracytion of Experiments. Chapman & HllCRC, p.5-9.with five-fold cross-validationAshburner, M., Ball, C.A., Blake, J.A., Botstein, D., Butler, H,Cherr, J.M, Davis, A.P, Dolinski, K, Dwight, S.S, Ep-Classifier Best CBest yAverage accuracy (%)pig, J.T, et al., 2000. Gene Ontology: tool for the unif-1-20.12581.3008cation of biology. Nat. Genet, 25:25-29. [doi:10.1038/-33.091 .056975556]-432.0.007812584.1270Benson, D.A., Karsch-Mizrachi, I, Lipman, D.J, Ostell, J,32.0 0.000122070312570.8333Wheeler, D.L, 2006. GenBank. Nucleic Acids Research,8.00.0312534(Database isue):16-20. [doi:10.1093/nar/gkj157]2.074.7967Brown, M.P.S., Grundy, W N., Lin, D., Cristianini, N,, Sugnet,C.W，Furey, T.S, Ares, M.Jr, Haussler, D.， 2000.Knowledge-based analysis of microarray gene expressiondata using support vector machines. PNAS, 97():262-RESULT267. [doi:10.1073/pnas.97.1.262]Burges, C, 1998. A tutorial on support vector machines forThe result shows that microarray data can bepattern recognition. Data Mining and Knowledge Dis-used to predict the function of genes in biologicalcovery, 2(2):121-167. [doi:10. 1023/A: 1009715923555]process. Data in Table 2 show that the classifier withChang, C.C, Lin, CJ, 2001. LIBSVM: A Library for SupportVector Machines. Software available at htp://ww .csie.category ID 1 has high accuracy. Because the bio-ntu.edu.tw/-cjlin/libsvmchemical experiment was carried out during a periodChou, K.C., Cai, Y.D, 2003. A new hybrid approach to predictof time with temperature changes, the regulation ofsubellular localization of proteins by incorporating Genecellular respiration exhibits their existence more no-Ontology. Biochem. Biophys. Research Commun, 311(3):tably. The accuracy of classifier 1-3 is higher than that743-747. [oi:10.10161j. bbrc.2003.10.062]of 1-2 or 1-4 because“cellular respiration” and Eisen, M.B, Sellman, P.T, Brown, P.O, Botstein, D, 1998.Cluster analysis and display of genome-wide expression“regulation of gene expression, epigenetic”are morepatterns. PNAS, 95(25):14863- 14868. [doi:10. 1073/pnas.unrelated than the other two couples. The result is95.25. 14863]reasonable because of its explanations by biological Gruber, T.R., 1995. Toward principles for the design of on-mechanism (Brown et al, 2000).tologies used for knowledge sharing. Int. J. Human-Computer Studies, 43(5-6):907-928. [doi:10. 1006/jhc.1995.081]CONCLUSIONKennedy, P.J, Simoff, S.J, Skillicorm, D.B.,. Catchpoole, D.,2004. Extracting and Explaining Biological Knowledgein Microarray Data. Pacific-Asia Conference on Knowl-Data mining based on ontological engineeringedge Discovery and Data Mining, p.699-703.has many advantages. There have been many suc- Mizoguchi, R., 2003. Tutorial on ontological engineeringcessful cases with the method. This paper gives a .Part 1: introduction to ontological engineering. Newcommon introduction to it and presents a practicalGeneration Computing, 21(4):365-384.analysis example using SVM (support vector machine)Pavlidis, P, Qin, J, Arango, V, Mann, J, Sibille, E., 2004.Using the gene ontology for microarray data mining: aas classifier. Our research mainly focused on mi-croarray data analysis, as the application of onto-human prefrontal cortex. Neurochemical Research, 29(6):logical engineering is relatively mature in the field.1213-1222. [doi:10.1023/B:NERE 000023608.29741.45]The method can also be used in other fields with Phan, J.H, Quo, C.E. Gno. KJ. FenW.M, Wang, G.similar scenario. With the accumulation of knowl-Wang, M.D.中国煤化工vedge -basedFYHCNMH G.Liet al. 1J Zhejiang Univ SciA 2007 8(4):638-643643Multi-scheme Cancer Microarray Data Analysis System.Dmitrovsky, E., Lander, E., Golub, T, 1999. InterpretingProc.2004 IEEE Computational Systems Bioinformaticspatterns of gene expression with self-organizing maps:Conference (CSB'04), p.474-475.methods and application to hematopoietic differentiation.Roos, D., 2001. Bioinformatics- trying to swim in a sea ofPNAS, 96(6):2907-2912. [doi:10.1073/pnas .96.6.2907]data. Science, 291:1260-1 261. [doi:10.1126/science.291. The Gene Ontology Consortium, 2001. Creating the Gene5507.1260]Ontology resource: design and implementation. GenomeTamayo, P., Slonim, D., Mesirov, J., Zhu, Q,, Kitareewan, S.,Res, 11(8):1425-1433. [doi:10.1 101/gr.180801]Editor-in-Chief: Wei YANGISSN 1673-565X (Print); ISSN 1862-1775 (Online), monthlyJournal of Zhejiang UniversitySCIENCE Awww.zju.edu.cn/jzus; www.springerlink.comjzus@zju.edu.cnJZUS-A focuses on“Applied Physics & Engineering”➢Welcome your contributions to JZUS-AJournal of Zhejiang University SCIENCE A warmly and sincerely welcomes scientists all overthe world to contribute Reviews, Articles and Science Letters focused on Applied Physics & Engi-neering. Especially, Science Letters (3~4 pages) would be published as soon as about 30 days (Note:detailed research articles can still be published in the professional journals in the future after ScienceLetters is published by JZUS-A).JZUS is linked by (open access):SpringerLink: htp://www.springerlink.com;CrossRef: htp://www .crossref.org; (doi: 10.163 1/jzus.xxxxxx)HighW ire: htp://highwire .stanford.edu/top/journals.dt;Princeton University Library: htp://ibweb5.princeton.edu/ejournals/;California State University Library: htp://fr5je3se5g.search.serialssolutions.com;PMC: htp://www.pubmedcentral.nih.gov/tocrender. fegi?journal=371&action archiveWelcome your view or comment on any item in the journal, or related matters to:Helen Zhang, Managing Editor of JZUSEmail: jzus@zju.edu.cn, Tel/Fax: 86-571-87952276/87952331中国煤化工MHCNMH G.

论文截图