Orchestrating privacy-protected big data analyses of data from different resources with R and DataSHIELD
Mostra el registre complet Registre parcial de l'ítem
- dc.contributor.author Marcon, Yannick
- dc.contributor.author Bishop, Tom
- dc.contributor.author Avraam, Demetris
- dc.contributor.author Escriba-Montagut, Xavier
- dc.contributor.author Ryser-Welch, Patricia
- dc.contributor.author Wheater, Stuart
- dc.contributor.author Burton, Paul
- dc.contributor.author González, Juan Ramón
- dc.date.accessioned 2022-05-31T07:00:29Z
- dc.date.available 2022-05-31T07:00:29Z
- dc.date.issued 2021
- dc.description.abstract Combined analysis of multiple, large datasets is a common objective in the health- and biosciences. Existing methods tend to require researchers to physically bring data together in one place or follow an analysis plan and share results. Developed over the last 10 years, the DataSHIELD platform is a collection of R packages that reduce the challenges of these methods. These include ethico-legal constraints which limit researchers' ability to physically bring data together and the analytical inflexibility associated with conventional approaches to sharing results. The key feature of DataSHIELD is that data from research studies stay on a server at each of the institutions that are responsible for the data. Each institution has control over who can access their data. The platform allows an analyst to pass commands to each server and the analyst receives results that do not disclose the individual-level data of any study participants. DataSHIELD uses Opal which is a data integration system used by epidemiological studies and developed by the OBiBa open source project in the domain of bioinformatics. However, until now the analysis of big data with DataSHIELD has been limited by the storage formats available in Opal and the analysis capabilities available in the DataSHIELD R packages. We present a new architecture ("resources") for DataSHIELD and Opal to allow large, complex datasets to be used at their original location, in their original format and with external computing facilities. We provide some real big data analysis examples in genomics and geospatial projects. For genomic data analyses, we also illustrate how to extend the resources concept to address specific big data infrastructures such as GA4GH or EGA, and make use of shell commands. Our new infrastructure will help researchers to perform data analyses in a privacy-protected way from existing data sharing initiatives or projects. To help researchers use this framework, we describe selected packages and present an online book (https://isglobal-brge.github.io/resource_bookdown).
- dc.description.sponsorship This research has received funding from the European Union’s Horizon 2020 research and innovation programme under grant agreement No 874583 (ATHLETE) an No 824989 (EUCAN-Connect); the Ministerio de Ciencia, Innovación y Universidades (MICIU), Agencia Estatal de Investigación (AEI) and Fondo Europeo de Desarrollo Regional, UE (RTI2018-100789-B-I00) also through the “Centro de Excelencia Severo Ochoa 2019-2023” Program (CEX2018-000806-S); and the Catalan Government through the CERCA Program. This article is part of the project VEIS: 001-P-001647 co-financed by the European Regional Development Fund of the European Union in the framework of the Operational Program FEDER of Catalonia 2014-2020 with the support of the Secretaria d'Universitats i Recerca del Departament d'Empresa i Coneixement de la Generalitat de Catalunya. This work also forms part of Newcastle University’s methods program in Health Data Science addressing the securing of sensitive data; with support from the Department of Health and Social Care under the Connected Health Cities (North East North Cumbria) project and from joint Wellcome Trust/Medical Research Council grant 108439/A/15/Z. The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.
- dc.format.mimetype application/pdf
- dc.identifier.citation Marcon Y, Bishop T, Avraam D, Escriba-Montagut X, Ryser-Welch P, Wheater S, Burton P, González JR. Orchestrating privacy-protected big data analyses of data from different resources with R and DataSHIELD. PLoS Comput Biol. 2021 Mar 30;17(3):e1008880. DOI: 10.1371/journal.pcbi.1008880
- dc.identifier.doi http://dx.doi.org/10.1371/journal.pcbi.1008880
- dc.identifier.issn 1553-734X
- dc.identifier.uri http://hdl.handle.net/10230/53321
- dc.language.iso eng
- dc.publisher Public Library of Science (PLoS)
- dc.relation.ispartof PLoS Comput Biol. 2021 Mar 30;17(3):e1008880
- dc.relation.projectID info:eu-repo/grantAgreement/EC/H2020/874583
- dc.relation.projectID info:eu-repo/grantAgreement/EC/H2020/824989
- dc.relation.projectID info:eu-repo/grantAgreement/ES/2PE/RTI2018-100789-B-I00
- dc.rights © 2021 Marcon et al. This is an open access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.
- dc.rights.accessRights info:eu-repo/semantics/openAccess
- dc.rights.uri http://creativecommons.org/licenses/by/4.0/
- dc.subject.keyword Genomics
- dc.subject.keyword Genome analysis
- dc.subject.keyword Computer software
- dc.subject.keyword Data management
- dc.subject.keyword Genome-wide association studies
- dc.subject.keyword Database searching
- dc.subject.keyword Graphical user interfaces
- dc.subject.keyword Metaanalysis
- dc.title Orchestrating privacy-protected big data analyses of data from different resources with R and DataSHIELD
- dc.type info:eu-repo/semantics/article
- dc.type.version info:eu-repo/semantics/publishedVersion