Departament de Tecnologies de la Informació i les Comunicacions

Documents de recerca, en accés obert, com ara articles de revista, llibres, comunicacions, ponències o posters a jornades i congressos, etc., del Departament de Tecnologies de la Informació i de les Comunicacions de la UPF.

URI permanent per a aquesta comunitat http://hdl.handle.net/10230/5922

Navigate

Browse

Recent Submissions

  • Open AccessItem type: Item ,
    Exploring morphology-aware tokenization: a case study on Spanish language modeling
    (ACL (Association for Computational Linguistics), 2025) Táboas García, Alba; Przybyła, Piotr; Wanner, Leo
    This paper investigates to what extent the integration of morphological information can improve subword tokenization and thus also language modeling performance. We focus on Spanish, a language with fusional morphology, where subword segmentation can benefit from linguistic structure. Instead of relying on purely data-driven strategies like Byte Pair Encoding (BPE), we explore a linguistically grounded approach: training a tokenizer on morphologically segmented data. To do so, we develop a semi-supervised segmentation model for Spanish, building gold-standard datasets to guide and evaluate it. We then use this tokenizer to pre-train a masked language model and assess its performance on several downstream tasks. Our results show improvements over a baseline with a standard tokenizer, supporting our hypothesis that morphology-aware tokenization offers a viable and principled alternative for improving language modeling.
  • Open AccessItem type: Item ,
    Countering disinformation by finding reliable sources: a citation-based approach
    (Institute of Electrical and Electronics Engineers (IEEE), 2022) Przybyła, Piotr; Borkowski, Piotr; Kaczyński, Konrad
    We propose a new task aimed at countering dis- and misinformation, called Finding Reliable Sources. Given a one-sentence claim, the challenge is to automatically find a knowledge source (e.g. a book, a research article, a web page) that could support or refute the claim. We show that this capability could be learnt by observing associations between sentences in English Wikipedia and citations provided for them. Thus, we collect a corpus of over 50 million references to 24 million identified sources with the citation context from Wikipedia, and build search indices using several meaning representation methods. For evaluation, apart from the Wikipedia corpus, we prepare another test set based on the FEVER fact-checking dataset.
  • Open AccessItem type: Item ,
    Attacking misinformation detection using adversarial examples generated by language models
    (ACL (Association for Computational Linguistics), 2025) Przybyła, Piotr; McGill, Euan; Saggion, Horacio
    Large language models have many beneficial applications, but can they also be used to attack content-filtering algorithms in social media platforms? We investigate the challenge of generating adversarial examples to test the robustness of text classification algorithms detecting low-credibility content, including propaganda, false claims, rumours and hyperpartisan news. We focus on simulation of content moderation by setting realistic limits on the number of queries an attacker is allowed to attempt. Within our solution (TREPAT), initial rephrasings are generated by large language models with prompts inspired by meaning-preserving NLP tasks, such as text simplification and style transfer. Subsequently, these modifications are decomposed into small changes, applied through beam search procedure, until the victim classifier changes its decision. We perform quantitative evaluation using various prompts, models and query limits, targeted manual assessment of the generated text and qualitative linguistic analysis. The results confirm the superiority of our approach in the constrained scenario, especially in case of long input text (news articles), where exhaustive search is not feasible.
  • Open AccessItem type: Item ,
    Capturing the style of fake news
    (AAAI Press, 2020) Przybyla, Piotr
    In this study we aim to explore automatic methods that can detect online documents of low credibility, especially fake news, based on the style they are written in. We show that general-purpose text classifiers, despite seemingly good performance when evaluated simplistically, in fact overfit to sources of documents in training data. In order to achieve a truly style-based prediction, we gather a corpus of 103,219 documents from 223 online sources labelled by media experts, devise realistic evaluation scenarios and design two new classifiers: a neural network and a model based on stylometric features. The evaluation shows that the proposed classifiers maintain high accuracy in case of documents on previously unseen topics (e.g. new events) and from previously unseen sources (e.g. emerging news websites). An analysis of the stylometric model indicates it indeed focuses on sensational and affective vocabulary, known to be typical for fake news.
  • Open AccessItem type: Item ,
    In vivo whole-cortex marker of excitation-inhibition ratio indexes cortical maturation and cognitive ability in youth
    (National Academy of Sciences, 2024) Zhang, Shaoshi; Deco, Gustavo; Yeo, B. T. Thomas
    A balanced excitation-inhibition ratio (E/I ratio) is critical for healthy brain function. Normative development of cortex-wide E/I ratio remains unknown. Here, we noninvasively estimate a putative marker of whole-cortex E/I ratio by fitting a large-scale biophysically plausible circuit model to resting-state functional MRI (fMRI) data. We first confirm that our model generates realistic brain dynamics in the Human Connectome Project. Next, we show that the estimated E/I ratio marker is sensitive to the gamma-aminobutyric acid (GABA) agonist benzodiazepine alprazolam during fMRI. Alprazolam-induced E/I changes are spatially consistent with positron emission tomography measurement of benzodiazepine receptor density. We then investigate the relationship between the E/I ratio marker and neurodevelopment. We find that the E/I ratio marker declines heterogeneously across the cerebral cortex during youth, with the greatest reduction occurring in sensorimotor systems relative to association systems. Importantly, among children with the same chronological age, a lower E/I ratio marker (especially in the association cortex) is linked to better cognitive performance. This result is replicated across North American (8.2 to 23.0 y old) and Asian (7.2 to 7.9 y old) cohorts, suggesting that a more mature E/I ratio indexes improved cognition during normative development. Overall, our findings open the door to studying how disrupted E/I trajectories may lead to cognitive dysfunction in psychopathology that emerges during youth.