Welcome to the UPF Digital Repository

An assessment of gene prediction accuracy in large DNA sequences

Show simple item record

dc.contributor.author Guigó Serra, Roderic
dc.contributor.author Agarwal, Pankaj
dc.contributor.author Abril Ferrando, Josep Francesc
dc.contributor.author Burset Albareda, Moisès
dc.contributor.author Fickett, James W.
dc.date.accessioned 2012-05-21T09:44:56Z
dc.date.available 2012-05-21T09:44:56Z
dc.date.issued 2000
dc.identifier.citation Guigó R, Agarwal P, Abril JF, Burset M, Fickett JW. An assessment of gene prediction accuracy in large DNA sequences. Genome Res. 2000;10(10):1631-42. DOI: 10.1101/gr.122800
dc.identifier.issn 1088-9051
dc.identifier.uri http://hdl.handle.net/10230/16471
dc.description.abstract One of the first useful products from the human genome will be a set of predicted genes. Besides its intrinsic scientific interest, the accuracy and completeness of this data set is of considerable importance for human health and medicine. Though progress has been made on computational gene identification in terms of both methods and accuracy evaluation measures, most of the sequence sets in which the programs are tested are short genomic sequences, and there is concern that these accuracy measures may not extrapolate well to larger, more challenging data sets. Given the absence of experimentally verified large genomic data sets, we constructed a semiartificial test set comprising a number of short single-gene genomic sequences with randomly generated intergenic regions. This test set, which should still present an easier problem than real human genomic sequence, mimics the approximately 200kb long BACs being sequenced. In our experiments with these longer genomic sequences, the accuracy of GENSCAN, one of the most accurate ab initio gene prediction programs, dropped significantly, although its sensitivity remained high. Conversely, the accuracy of similarity-based programs, such as GENEWISE, PROCRUSTES, and BLASTX was not affected significantly by the presence of random intergenic sequence, but depended on the strength of the similarity to the protein homolog. As expected, the accuracy dropped if the models were built using more distant homologs, and we were able to quantitatively estimate this decline. However, the specificities of these techniques are still rather good even when the similarity is weak, which is a desirable characteristic for driving expensive follow-up experiments. Our experiments suggest that though gene prediction will improve with every new protein that is discovered and through improvements in the current set of tools, we still have a long way to go before we can decipher the precise exonic structure of every gene in the human genome using purely computational methodology.
dc.format.mimetype application/pdf
dc.language.iso eng
dc.publisher Cold Spring Harbor Laboratory Press-CSHL Press
dc.relation.ispartof Genome Research. 2000;10(10):1631-42
dc.rights © 2000 Genome Research by Cold Spring Harbor Laboratory Press. Published version available at http://genome.cshlp.org. Aquest document està subjecte a Llicència Creative Commons (Attribution-NonCommercial 3.0 Unported License)
dc.rights.uri http://creativecommons.org/licenses/by-nc/3.0/
dc.subject.other Seqüències de nucleòtids
dc.subject.other Genètica humana
dc.title An assessment of gene prediction accuracy in large DNA sequences
dc.type info:eu-repo/semantics/article
dc.identifier.doi http://dx.doi.org/10.1101/gr.122800
dc.rights.accessRights info:eu-repo/semantics/openAccess
dc.type.version info:eu-repo/semantics/publishedVersion


This item appears in the following Collection(s)

Show simple item record

Search DSpace

Advanced Search


My Account


In collaboration with Compliant to Partaking