Large multiple sequence alignments with a root-to-leaf regressive method

Mostra el registre complet Registre parcial de l'ítem

  • dc.contributor.author Garriga, Edgar
  • dc.contributor.author Di Tommaso, Paolo
  • dc.contributor.author Magis, Cedrik
  • dc.contributor.author Erb, Ionas
  • dc.contributor.author Mansouri, Leila
  • dc.contributor.author Baltzis, Athanasios
  • dc.contributor.author Laayouni, Hafid, 1968-
  • dc.contributor.author Kondrashov, Fyodor A., 1979-
  • dc.contributor.author Floden, Evan
  • dc.contributor.author Notredame, Cedric
  • dc.date.accessioned 2025-01-29T13:36:25Z
  • dc.date.available 2025-01-29T13:36:25Z
  • dc.date.issued 2019
  • dc.description.abstract Multiple sequence alignments (MSAs) are used for structural1,2 and evolutionary predictions1,2, but the complexity of aligning large datasets requires the use of approximate solutions3, including the progressive algorithm4. Progressive MSA methods start by aligning the most similar sequences and subsequently incorporate the remaining sequences, from leaf to root, based on a guide tree. Their accuracy declines substantially as the number of sequences is scaled up5. We introduce a regressive algorithm that enables MSA of up to 1.4 million sequences on a standard workstation and substantially improves accuracy on datasets larger than 10,000 sequences. Our regressive algorithm works the other way around from the progressive algorithm and begins by aligning the most dissimilar sequences. It uses an efficient divide-and-conquer strategy to run third-party alignment methods in linear time, regardless of their original complexity. Our approach will enable analyses of extremely large genomic datasets such as the recently announced Earth BioGenome Project, which comprises 1.5 million eukaryotic genomes6.
  • dc.description.sponsorship This project was supported by the Centre for Genomic Regulation, the Spanish Plan Nacional, the Spanish Ministry of Economy and Competitiveness, ‘Centro de Excelencia Severo Ochoa’ (E.G., P.T., C.M., I.E., L.M., A.B., F.K., E.F. and C.N.) and an ERC Consolidator Grant from the European Commission, grant agreement no. 771209 ChrFL (F.K.).
  • dc.format.mimetype application/pdf
  • dc.identifier.citation Garriga E, Di Tommaso P, Magis C, Erb I, Mansouri L, Baltzis A, et al. Large multiple sequence alignments with a root-to-leaf regressive method. Nat Biotechnol. 2019 Dec;37(12):1466-70. DOI: 10.1038/s41587-019-0333-6
  • dc.identifier.doi http://dx.doi.org/10.1038/s41587-019-0333-6
  • dc.identifier.issn 1087-0156
  • dc.identifier.uri http://hdl.handle.net/10230/69357
  • dc.language.iso eng
  • dc.publisher Nature Research
  • dc.relation.ispartof Nat Biotechnol. 2019 Dec;37(12):1466-70
  • dc.relation.projectID info:eu-repo/grantAgreement/EC/H2020/771209
  • dc.rights © Springer Nature Publishing AG Garriga E, Di Tommaso P, Magis C, Erb I, Mansouri L, Baltzis A, et al. Large multiple sequence alignments with a root-to-leaf regressive method. Nat Biotechnol. 2019 Dec;37(12):1466-70. DOI: 10.1038/s41587-019-0333-6 [http://dx.doi.org/10.1038/s41587-019-0333-6]
  • dc.rights.accessRights info:eu-repo/semantics/openAccess
  • dc.subject.keyword Computational models
  • dc.subject.keyword Phylogenetics
  • dc.title Large multiple sequence alignments with a root-to-leaf regressive method
  • dc.type info:eu-repo/semantics/article
  • dc.type.version info:eu-repo/semantics/acceptedVersion