Thompson, MikeMartín, MarianoSanmartín Olmo, TrinidadRajesh, ChandanaKoo, Peter K.Bolognesi, BenedettaLehner, Ben, 1978-2025-09-082025-09-082025Thompson M, Martín M, Olmo TS, Rajesh C, Koo PK, Bolognesi B, et al. Massive experimental quantification allows interpretable deep learning of protein aggregation. Sci Adv. 2025 May 2;11(18):eadt5111. DOI: 10.1126/sciadv.adt51112375-2548http://hdl.handle.net/10230/71148Protein aggregation is a pathological hallmark of more than 50 human diseases and a major problem for biotechnology. Methods have been proposed to predict aggregation from sequence, but these have been trained and evaluated on small and biased experimental datasets. Here we directly address this data shortage by experimentally quantifying the aggregation of >100,000 protein sequences. This unprecedented dataset reveals the limited performance of existing computational methods and allows us to train CANYA, a convolution-attention hybrid neural network that accurately predicts aggregation from sequence. We adapt genomic neural network interpretability analyses to reveal CANYA's decision-making process and learned grammar. Our results illustrate the power of massive experimental analysis of random sequence-spaces and provide an interpretable and robust neural network model to predict aggregation.application/pdfengThis is an open-access article distributed under the terms of the Creative Commons Attribution license (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.Proteïnes--AgregacióMassive experimental quantification allows interpretable deep learning of protein aggregationinfo:eu-repo/semantics/articlehttp://dx.doi.org/10.1126/sciadv.adt5111info:eu-repo/semantics/openAccess