Scalable workflows and reproducible data analysis for genomics

Strozzi, Francesco; Jansenn, Roel; Wurmus, Ricardo; Crusoe, Michael R.; Githinji, George; Di Tommaso, Paolo; Belhachemi, Dominique; Möller, Steffen; Smant, Geert; de Ligt, Joep; Prins, Pjotr

Scalable workflows and reproducible data analysis for genomics

Mostra el registre complet Registre parcial de l'ítem

dc.contributor.author Strozzi, Francesco
dc.contributor.author Jansenn, Roel
dc.contributor.author Wurmus, Ricardo
dc.contributor.author Crusoe, Michael R.
dc.contributor.author Githinji, George
dc.contributor.author Di Tommaso, Paolo
dc.contributor.author Belhachemi, Dominique
dc.contributor.author Möller, Steffen
dc.contributor.author Smant, Geert
dc.contributor.author de Ligt, Joep
dc.contributor.author Prins, Pjotr
dc.date.accessioned 2022-05-17T10:57:54Z
dc.date.available 2022-05-17T10:57:54Z
dc.date.issued 2019
dc.description.abstract Biological, clinical, and pharmacological research now often involves analyses of genomes, transcriptomes, proteomes, and interactomes, within and between individuals and across species. Due to large volumes, the analysis and integration of data generated by such high-throughput technologies have become computationally intensive, and analysis can no longer happen on a typical desktop computer.In this chapter we show how to describe and execute the same analysis using a number of workflow systems and how these follow different approaches to tackle execution and reproducibility issues. We show how any researcher can create a reusable and reproducible bioinformatics pipeline that can be deployed and run anywhere. We show how to create a scalable, reusable, and shareable workflow using four different workflow engines: the Common Workflow Language (CWL), Guix Workflow Language (GWL), Snakemake, and Nextflow. Each of which can be run in parallel.We show how to bundle a number of tools used in evolutionary biology by using Debian, GNU Guix, and Bioconda software distributions, along with the use of container systems, such as Docker, GNU Guix, and Singularity. Together these distributions represent the overall majority of software packages relevant for biology, including PAML, Muscle, MAFFT, MrBayes, and BLAST. By bundling software in lightweight containers, they can be deployed on a desktop, in the cloud, and, increasingly, on compute clusters.By bundling software through these public software distributions, and by creating reproducible and shareable pipelines using these workflow engines, not only do bioinformaticians have to spend less time reinventing the wheel but also do we get closer to the ideal of making science reproducible. The examples in this chapter allow a quick comparison of different solutions.
dc.format.mimetype application/pdf
dc.identifier.citation Strozzi F, Janssen R, Wurmus R, Crusoe MR, Githinji G, Di Tommaso P et al. Scalable workflows and reproducible data analysis for genomics. Methods Mol Biol. 2019;1910:723-745. DOI:10.1007/978-1-4939-9074-0_24
dc.identifier.doi http://dx.doi.org/10.1007/978-1-4939-9074-0_24
dc.identifier.issn 1940-6029
dc.identifier.uri http://hdl.handle.net/10230/53117
dc.language.iso eng
dc.publisher Springer
dc.rights © Francesco Stozzi et al. 2019. This chapter is licensed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence and indicate if changes were made
dc.rights.accessRights info:eu-repo/semantics/openAccess
dc.rights.uri http://creativecommons.org/licenses/by/4.0/
dc.subject.other Genòmica
dc.subject.other Informàtica biològica
dc.title Scalable workflows and reproducible data analysis for genomics
dc.type info:eu-repo/semantics/article
dc.type.version info:eu-repo/semantics/publishedVersion

Col·leccions

Articles (Center for Genomic Regulation (CRG))