Calibrating seed-based heuristics to map short reads with sesame

dc.contributor.authorFilion, Guillaume
dc.contributor.authorCortini, Ruggero
dc.contributor.authorZorita, Eduard
dc.date.accessioned2020-10-20T06:02:00Z
dc.date.available2020-10-20T06:02:00Z
dc.date.issued2020
dc.description.abstractThe increasing throughput of DNA sequencing technologies creates a need for faster algorithms. The fate of most reads is to be mapped to a reference sequence, typically a genome. Modern mappers rely on heuristics to gain speed at a reasonable cost for accuracy. In the seeding heuristic, short matches between the reads and the genome are used to narrow the search to a set of candidate locations. Several seeding variants used in modern mappers show good empirical performance but they are difficult to calibrate or to optimize for lack of theoretical results. Here we develop a theory to estimate the probability that the correct location of a read is filtered out during seeding, resulting in mapping errors. We describe the properties of simple exact seeds, skip seeds and MEM seeds (Maximal Exact Match seeds). The main innovation of this work is to use concepts from analytic combinatorics to represent reads as abstract sequences, and to specify their generative function to estimate the probabilities of interest. We provide several algorithms, which together give a workable solution for the problem of calibrating seeding heuristics for short reads. We also provide a C implementation of these algorithms in a library called Sesame. These results can improve current mapping algorithms and lay the foundation of a general strategy to tackle sequence alignment problems. The Sesame library is open source and available for download at https://github.com/gui11aume/sesame.
dc.description.sponsorshipWe acknowledge the financial support of the Spanish Ministry of Economy, Industry and Competitiveness (Centro de Excelencia Severo Ochoa 2013–2017, Plan Estatal PGC2018-099807-B-I00), of the CERCA Programme/Generalitat de Catalunya, and of the European Research Council (Synergy Grant 609989). RC was supported by the People Programme (Marie Curie Actions) of the European Union's Seventh Framework Programme (FP7/2007-2013) under REA grant agreement 608959. We also acknowledge support of the Spanish Ministry of Economy and Competitiveness (MEIC) to the EMBL partnership.
dc.format.mimetypeapplication/pdf
dc.identifier.citationFilion GJ, Cortini R, Zorita E. Calibrating seed-based heuristics to map short reads with sesame. Front Genet. 2020; 11:572. DOI: 10.3389/fgene.2020.00572
dc.identifier.doihttp://dx.doi.org/10.3389/fgene.2020.00572
dc.identifier.issn1664-8021
dc.identifier.urihttp://hdl.handle.net/10230/45518
dc.language.isoeng
dc.publisherFrontiers
dc.relation.ispartofFront Genet. 2020; 11:572
dc.relation.projectIDinfo:eu-repo/grantAgreement/EC/FP7/609989
dc.relation.projectIDinfo:eu-repo/grantAgreement/EC/FP7/608959
dc.rights© 2020 Filion, Cortini and Zorita. This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) and the copyright owner(s) are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.
dc.rights.accessRightsinfo:eu-repo/semantics/openAccess
dc.rights.urihttp://creativecommons.org/licenses/by/4.0/
dc.subject.keywordC library
dc.subject.keywordAnalytic combinatorics
dc.subject.keywordHeuristic algorithms
dc.subject.keywordProbability
dc.subject.keywordSeeding accuracy
dc.titleCalibrating seed-based heuristics to map short reads with sesame
dc.typeinfo:eu-repo/semantics/article
dc.type.versioninfo:eu-repo/semantics/publishedVersion

Files

Original bundle

Now showing 1 - 1 of 1
Loading...
Thumbnail Image
Name:
Filion_fg_cali.pdf
Size:
1.92 MB
Format:
Adobe Portable Document Format

License

Rights