Dealing with noise and biases in omics: statistical methods, deep learning and workflows
Loading...
Document Type
Document Version
Author
Director
Notredame, Cedric
Tutor
Notredame, Cedric
Other authors
Publication Date
Pages
206 p.
Embargo date
Citation
Jin Wu, S. Dealing with noise and biases in omics: statistical methods, deep learning and workflows. Universitat Pompeu Fabra; 2026. handle: http://hdl.handle.net/10803/696721
This citation was generated automatically.
Citation
Doctoral program
Universitat Pompeu Fabra. Doctorat en Biomedicina
Abstract
El auge de las tecnologías de alto rendimiento ha revolucionado las ciencias de la vida, permitiendo pasar del estudio de moléculas individuales al perfilado de genomas, transcriptomas, epigenomas, microbiomas y más. Aunque muy potentes, estas tecnologías generan datos de alta dimensionalidad y con un grado considerable de ruido. Además, las técnicas de medición suelen producir valores relativos en lugar de absolutos, lo que introduce un sesgo composicional: si la cantidad absoluta de un componente aumenta, la proporción relativa de los demás se reduce automáticamente. Estas particularidades complican de forma sustancial cualquier intento de análisis o interpretación. Los métodos estadísticos tradicionales, por ejemplo, suelen asumir que las variables son independientes y se expresan en una escala absoluta, una condición que los datos composicionales no cumplen. A su vez, técnicas más complejas como el aprendizaje profundo, pese a su potencia, son extremadamente sensibles a las características de los datos: pueden sobreajustarse fácilmente a patrones ruidosos y fracasar al generalizar a nuevos contextos. El éxito de AlphaFold2 (AF2) ilustra el potencial del aprendizaje profundo en biología, pero también evidencia una limitación clave: su desarrollo se benefició de un problema bien definido y de datos extensos, limpios y estandarizados, condiciones poco frecuentes en ómicas. Numerosos métodos fueron desarrollados para afrontar las complejidades inherentes a los datos ómicos; sin embargo, ninguno ha resultado universalmente óptimo. Además, la elección de métodos suele estar condicionada por factores prácticos, como facilidad de uso, popularidad o accesibilidad, más que por su idoneidad real frente a los datos, lo que lleva a resultados subóptimos. Mi tesis aborda estos desafíos a través de múltiples contribuciones, con un foco particular en transcriptómica. Específicamente, investigué cómo el sesgo composicional afecta el análisis de correlación entre genes y propuse una forma de calcular correlaciones parciales regularizadas válidas para datos composicionales. También reinterpreté la proporcionalidad diferencial como una alternativa al análisis de expresión diferencial que evita la necesidad de normalización. Paralelamente, contribuí al desarrollo de nf-core/differentialabundance, un pipeline reproducible y escalable para el análisis diferencial dentro del ecosistema nf-core. Aunque actualmente soporta un conjunto limitado de métodos, está diseñado para ser extensible y crecer con la contribución de la comunidad. El objetivo es hacer más accesibles los enfoques alternativos y facilitar la selección e integración informada de métodos mediante análisis comparativos automatizados. Por último, reconociendo que los datos, y no solo los algoritmos, son centrales tanto en el análisis estadístico tradicional como en el aprendizaje profundo, co-desarrollé stimulus-py y nf-core/deepmodeloptim, un marco conjunto que permite explorar de manera sistemática cómo distintos aspectos de los datos, o de su procesamiento, influyen en el comportamiento de los modelos. Al colocar los datos en el centro del desarrollo, este enfoque busca generar modelos de aprendizaje profundo más robustos y generalizables en biología. El éxito de AlphaFold2 demuestra el valor del aprendizaje profundo en biología, pero el objetivo final es ir más allá de logros aislados, hacia un futuro en el que el aprendizaje profundo ofrezca soluciones confiables, interpretables y biológicamente fundamentadas en una amplia gama de tareas. Stimulus-py y nf-core/deepmodeloptim representan pasos concretos en esa dirección: herramientas para optimizar el desarrollo de modelos alineándolos con las realidades de los datos biológicos. Finalmente, la última parte de esta tesis muestra cómo las predicciones estructurales de AF2, con precisión a nivel experimental, pueden emplearse para mejorar los alineamientos múltiples de secuencias.
The rise of high-throughput technologies has transformed life sciences, enabling a shift from studying single molecules to profiling entire genomes, transcriptomes, epigenomes, microbiomes, and beyond. While powerful, these technologies generate datasets that are inherently high-dimensional and noisy. Moreover, measurement techniques typically result in relative, rather than absolute, measurements; thereby introducing compositional bias: if the absolute quantities of a component increases, the relative proportion of other components automatically shrinks. Together, these characteristics challenge downstream analysis and interpretation. Traditional statistical frameworks, for example, often assume variables are independent and measured on an absolute scale, an assumption compositional data violate. Complex approaches like deep learning, while powerful, are extremely sensitive to the underlying data characteristics. These models can easily overfit noisy patterns in the training data and fail to generalize to new datasets. The success of AlphaFold2 (AF2) illustrates the potential of deep learning in biology, but also highlights a key limitation: AF2 benefited from a well-defined problem and access to large, clean, standardized datasets – conditions rarely met in omics. Although numerous methods have been developed to address the inherent complexities in omics data, none has emerged as universally optimal. Further complicating the paradigm, method choice is frequently shaped by practical factors such as ease of use, popularity, or accessibility, rather than by strict suitability to the data at hand, leading to suboptimal outcomes. My thesis addresses these challenges through multiple contributions, with a particular focus on transcriptomics. Specifically, I investigated how compositional bias affects gene-gene correlation analysis, and provided a way to compute regularized partial correlations valid for compositional data analysis. In addition, I reinterpreted differential proportionality, a normalization-free pairwise approach, as a proxy for differential expression analysis. In parallel, I contributed to the development of nf-core/differentialabundance, a reproducible, scalable framework for differential analysis built on the nf-core ecosystem. While it currently supports a limited set of methods, it is designed for extensibility and community-driven expansion. The goal is to make alternative approaches more accessible and facilitate informed method selection and integration through automated comparative analysis. Ultimately, recognizing that data, not just algorithms, is central to both traditional and deep learning analysis, I co-developed stimulus-py and nf-core/deepmodeloptim, a joint framework that enables systematic exploration of how different aspects of the data, or its processing, impact model behavior. By placing data at the center of model development, this approach aims to produce more robust, generalizable deep learning models in biology. The success of AlphaFold2 illustrates the value of domain-adapted deep learning, but the ultimate goal is to move beyond such isolated breakthroughs, towards a future where deep learning delivers reliable, interpretable, and biologically grounded solutions across a wide range of tasks. Stimulus-py and nf-core/deepmodeloptim represent concrete steps in that direction: tools to optimize model development by aligning it with the realities of biological data. Finally, the last part of this thesis shows how experimental-accuracy structural data predicted by AF2 can be used to improve multiple sequence alignments.
The rise of high-throughput technologies has transformed life sciences, enabling a shift from studying single molecules to profiling entire genomes, transcriptomes, epigenomes, microbiomes, and beyond. While powerful, these technologies generate datasets that are inherently high-dimensional and noisy. Moreover, measurement techniques typically result in relative, rather than absolute, measurements; thereby introducing compositional bias: if the absolute quantities of a component increases, the relative proportion of other components automatically shrinks. Together, these characteristics challenge downstream analysis and interpretation. Traditional statistical frameworks, for example, often assume variables are independent and measured on an absolute scale, an assumption compositional data violate. Complex approaches like deep learning, while powerful, are extremely sensitive to the underlying data characteristics. These models can easily overfit noisy patterns in the training data and fail to generalize to new datasets. The success of AlphaFold2 (AF2) illustrates the potential of deep learning in biology, but also highlights a key limitation: AF2 benefited from a well-defined problem and access to large, clean, standardized datasets – conditions rarely met in omics. Although numerous methods have been developed to address the inherent complexities in omics data, none has emerged as universally optimal. Further complicating the paradigm, method choice is frequently shaped by practical factors such as ease of use, popularity, or accessibility, rather than by strict suitability to the data at hand, leading to suboptimal outcomes. My thesis addresses these challenges through multiple contributions, with a particular focus on transcriptomics. Specifically, I investigated how compositional bias affects gene-gene correlation analysis, and provided a way to compute regularized partial correlations valid for compositional data analysis. In addition, I reinterpreted differential proportionality, a normalization-free pairwise approach, as a proxy for differential expression analysis. In parallel, I contributed to the development of nf-core/differentialabundance, a reproducible, scalable framework for differential analysis built on the nf-core ecosystem. While it currently supports a limited set of methods, it is designed for extensibility and community-driven expansion. The goal is to make alternative approaches more accessible and facilitate informed method selection and integration through automated comparative analysis. Ultimately, recognizing that data, not just algorithms, is central to both traditional and deep learning analysis, I co-developed stimulus-py and nf-core/deepmodeloptim, a joint framework that enables systematic exploration of how different aspects of the data, or its processing, impact model behavior. By placing data at the center of model development, this approach aims to produce more robust, generalizable deep learning models in biology. The success of AlphaFold2 illustrates the value of domain-adapted deep learning, but the ultimate goal is to move beyond such isolated breakthroughs, towards a future where deep learning delivers reliable, interpretable, and biologically grounded solutions across a wide range of tasks. Stimulus-py and nf-core/deepmodeloptim represent concrete steps in that direction: tools to optimize model development by aligning it with the realities of biological data. Finally, the last part of this thesis shows how experimental-accuracy structural data predicted by AF2 can be used to improve multiple sequence alignments.
Keywords
Transcriptomic analysis, Noise and biases, Normalization, Deep learning, Differential abundance, Gene-gene correlation, Compositional bias, Bioinformatics, Pipelines
Subjects
575 - General genetics. General cytogenetics. Immunogenetics. Evolution. Phylogeny
Publisher
Universitat Pompeu Fabra







