Navigate

Browse

Recent Submissions

  • Open AccessItem type: Item ,
    Code-switching and identity in digital China: language ideologies in mandarin–cantonese wechat communication
    (2025) Tu, Jingjing
    This study explores how digital code choice and code-switching between Mandarin and Cantonese shape identity construction, interactional strategies, and ideological positioning in WeChat group communication. Using a mixed-methods approach, the analysis draws on 370 chat messages (3,568 characters) and 74 questionnaire responses collected from two Cantonese-speaking groups. The results show that Mandarin is dominant, especially in formal or task-oriented exchanges, while Cantonese serves key social and emotional functions. Code-switching often signals shifts in tone, intimacy, or social stance. Users adapt code use according to group type, topic, and interpersonal relationships, balancing clarity with relational goals. The study also reveals ideological perceptions that associate Mandarin with rationality and standardization, while Cantonese expresses warmth and regional identity. These findings highlight how users navigate digital contexts by managing symbolic code values, contributing to broader understandings of multilingual interaction online.
  • Open AccessItem type: Item ,
    L'expressió de la genericitat en llengua de signes catalana
    (2025) Singla i Milian, Eloi
    Aquest treball estudia l’expressió de la genericitat en la llengua de signes catalana (LSC). Per fer-ho, es fan servir tres conjunts de dades: dos provenen d’elicitacions amb signants de dues generacions diferents i l’últim és el producte d’aplicar diagnòstics de reeixidesa de diversos tipus de sintagmes nominals en oracions amb lectura genèrica. Dels tres objectius del treball, el primer és comprovar i elaborar sobre les hipòtesis de la recerca prèvia del fenomen en LSC i s’assoleix corroborant-ne algunes, perfilant-ne d’altres i detectant un nou signe relacionat amb la genericitat. El segon és estudiar la possible microdiacronia entre la generació de signants joves i la d’edat avançada i se’n troben indicis tant en l’expressió com en la percepció de la genericitat. Finalment, l’últim objectiu és detallar noves línies de recerca per continuar la recerca, cosa que es fa en base a les troballes d’aquesta investigació.
  • Open AccessItem type: Item ,
    Psycholinguistic probing of language models' internal layers
    (2025) Rivera Hidalgo de Torralba, Paula
    This study investigates how transformer-based large language models (LLMs) resolve subject–verb agreement across syntactic structures of varying complexity and distractor conditions. We hypothesized that LLMs would succeed in simple configurations but struggle under deeper embedding and number mismatches. Using a controlled dataset adapted from psycholinguistic research, we analyze model behavior across six sentence structures, two attractor conditions (mismatch vs. no mismatch), and four lexical variants. Using the Pythia 6.9B model, we apply three evaluation metrics—accuracy, prediction depth, and the Tuned Lens interpretability method—to track how agreement resolution evolves across layers. Results confirm our hypothesis: the model performs reliably in simple structures but fails in deeply embedded object-relative clauses. Prediction depth shows early resolution in simple cases and delayed or failed resolution in complex ones. These findings clarify LLM limitations in syntactic processing and highlight the importance of using linguistically informed evaluation methods to better understand model behavior across structural configurations.
  • Open AccessItem type: Item ,
    Cognate status and dialectal variation in bilingual spoken word recognition: the role of language dominance in catalan–spanish listeners
    (2025) Ridameya Jan, Pau
    This study investigates how cognate status, dialectal variation, and language dominance influence spoken word recognition in Catalan–Spanish bilinguals. Thirty participants completed a lexical decision task in which they listened to Catalan words presented in either Central Catalan (their native dialect) or Valencian (a non-native dialect), and categorized them as real words or not. The stimuli included both cognates and non-cognates with Spanish. Reaction times were analyzed using linear mixed-effects models, with language dominance measured continuously using the Bilingual Language Profile. Results showed that dialectal variation and language dominance modulated the effect of cognate status: while no main effect of cognates emerged in Central Catalan, a significant cognate disadvantage was found in Valencian, especially among Spanish-dominant participants. These findings suggest that cross-linguistic phonological similarity can hinder rather than facilitate recognition when combined with dialectal unfamiliarity, and highlight the importance of individual linguistic profiles in bilingual lexical access. The study contributes to a more nuanced understanding of how phonological variation across dialects interacts with bilingual processing mechanisms.
  • Open AccessItem type: Item ,
    Piular or tuitejar? a study of social media loanwords in catalan
    (2025) Marsó Riera, Laia
    Globalisation has allowed increased contact between societies and cultures. The evolution of societies and the creation of new technologies make coining new terms an inevitable task for language users. This study aims to investigate the processes of integrating foreign words, especially verbs, into the Catalan language in a social media setting, with an informal register. We create a list of words to study and collect data from Twitter/X. We disambiguate the contexts in which the words relate to social media by prompting large language models, with 0.9 accuracy. Our findings show that using native words is more common than adapting foreign roots. This behaviour is not shared with other findings in the context of newspapers and magazine texts, and we hypothesise that the register may be the reason. We also find both inflection and light verb constructions used to create new verbal forms.
  • Open AccessItem type: Item ,
    Verbal borrowings in Russian social media: a diachronic corpus study
    (2025) Lysova, Iuliia
    This study examines the adaptation of English-derived verbal concepts related to social media in Russian, focusing on their orthographic, morphological, and semantic integration within online discourse. Based on a diachronic corpus of over 41,000 tweets from 2013 to 2024, the analysis employs computational tools to explore 21 English source concepts, such as ‘to post’, ‘to like’, and ‘to follow’, and their Russian equivalents. The findings reveal that while borrowed forms exhibit greater lemma diversity, their usage frequencies are comparable to those of translated variants. Morphological adaptation shows a growing preference for the suffix -ну- in the formation of perfective verbs. The most fully integrated borrowings display a wide range of meaningful affixal variants. Furthermore, the choice and integration of verbal concepts are influenced by a combination of semantic, social, and platform-specific factors, including users’ prior experience with other social media interfaces. These findings underscore social media as a dynamic environment for lexical innovation in Russian.
  • Open AccessItem type: Item ,
    Using i vs. you to refer to yourself in self-talk:exploring grammatical contextual variations in mandarin chinese
    (2025) Liu, Tongyu
    This study investigates how Mandarin Chinese speakers choose between first-person (“我”, wǒ) and second-person (“你”, nǐ) pronouns in self-talk across varying emotional and temporal contexts. Building on insights from both psychology and linguistics, the research explores how emotional valence (positive, neutral, negative), temporal orientation (retrospective, present-related, prospective), and the presence of a mirror (as a visual self-reflection cue) influence pronoun selection. It further examines whether pronoun choice in self-talk reflects categorical grammatical constraints or psychological tendencies. Fifty native Mandarin speakers completed a choice task involving 54 illustrated scenarios. Each scenario included a pair of self-talk utterances that were semantically equivalent but differed in pronoun use. Results revealed that first-person forms were the dominant choice overall. However, second-person usage increased significantly in negative scenarios. In addition, the presence of a mirror led to a substantial rise in second-person choices across all contexts. Statistical analyses confirmed significant main effects of emotional valence, temporal orientation, and mirror presence, as well as an interaction between valence and temporal framing. The study deepens our understanding of how contextual variables shape the choice of personal pronouns in self-talk. It also extends previous findings from Indo-European languages to Mandarin, thereby enriching cross-linguistic perspectives on self-talk. Keywords: self-talk, Mandarin Chinese, emotional valence, temporal perspective, mirror effect, first-person pronoun, second-person pronoun
  • Open AccessItem type: Item ,
    De Barcelona a Valladolid: comparación de actitudes lingüísticas en Cataluña y Valladolid frente a rasgos de del castellano de Cataluña
    (2025) González García, Roberto
    En este trabajo se analizan la aceptabilidad de variantes tradicionalmente asociadas al castellano hablado en Cataluña, comparando las percepciones de hablantes catalanes y vallisoletanos, así como los registros a los que se vinculan estas variantes. Se busca determinar si hay variantes que están perdiendo su carácter regional y ampliando su uso, y examinar cómo difieren las actitudes lingüísticas entre catalanes y hablantes de otras zonas, considerando posibles factores sociales, culturales e ideológicos. Para ello, se han utilizado formularios que recogieron datos sobre aceptabilidad y asociación por registros. Los resultados indican que existen variables que influyen en la aceptación de los hablantes, que los catalanes presentan actitudes más prescriptivistas hacia su variedad — posiblemente por la situación sociolingüística de la comunidad— y que algunas variantes están extendiendo su uso más allá de la variedad de castellano hablada en Cataluña.
  • Open AccessItem type: Item ,
    Semantic consistency in RAG: evaluating modern encoder-only models on active and passive voice in english and russian
    (2025) Fomicheva, Marina
    Retrieval-Augmented Generation (RAG) systems depend on dense embeddings to retrieve relevant context for open-domain question answering. A critical requirement for these embeddings is semantic consistency – the ability to remain stable across meaning-preserving variation. This study examines how modern encoder-only models handle active/passive voice alternations in English and Russian. Using a bilingual dataset of 500 factual question pairs, we evaluate semantic consistency (Overlap@K) and retrieval quality (MRR, Recall@K) in raw and fine-tuned versions of EuroBERT and RuModernBERT. Findings show that representations of raw encoders are only partially semantic: they are sensitive to word order, morphology, and query length. Consistency was significantly higher in English, indicating that morphologically rich languages like Russian are more challenging. EuroBERT performed poorly on Russian due to limited training exposure and subword fragmentation. RuModernBERT performed better on Russian passives, likely reflecting its training data. Contrastive fine-tuning substantially improved performance, though not all fine-tuned models benefited equally – EuroBERT_FT and LaBSE showed limitations tied to tokenization and training objectives.
  • Open AccessItem type: Item ,
    Terminología y lenguaje claro : el consentimiento informado para pacientes geriátricas
    (2025) Fernández-Lopez, Ismael
    El presente trabajo analiza la accesibilidad de las hojas de consentimiento informado (CI) dirigidas a mujeres mayores en el contexto sanitario español. A través de un corpus compuesto por nueve CI relacionados con intervenciones quirúrgicas frecuentes en este grupo demográfico (prolapso genital, cataratas y prótesis de cadera), se analizan las principales barreras de comprensión, poniendo el foco en el diseño y los niveles discursivo, morfosintáctico y léxico. Mediante herramientas automáticas como arText y encuestas a pacientes y personal sanitario, se constata la complejidad que suponen los elementos compositivos cultos y la escasa eficacia de una variación terminológica mal gestionada. Se propone y evalúa un modelo de CI enriquecido basado en criterios de lenguaje claro, cuya efectividad se valida empíricamente. Los resultados revelan que un diseño accesible y una terminología más transparente pueden mejorar significativamente la comprensión del documento médico y, por ende, reforzar el derecho a la información y la autonomía del paciente.
  • Open AccessItem type: Item ,
    The effect of practice frequency on pronunciation gains in a self-paced karaoke training with high school ESL students
    (2025) Dolz Lane, Madeleine Louisa
    Research shows that music can assist in second language pronunciation instruction, and that the use of familiar melodies can further bootstrap phonological acquisition. This study tested whether a self-paced karaoke training can improve English pronunciation among 52 Spanish-Catalan bilingual high schoolers aged 14-16. After two in-class tutorial sessions participants voluntarily completed between zero and five approximately ten minute long, self-paced karaoke training sessions over one week using English songs with familiar melodies (N of students with 0-2 training sessions = 22; N of students with 3-5 training sessions = 30). Pronunciation was assessed before and after training for comprehensibility, fluency, and accentedness, using sentence imitation and reading tasks with both trained and untrained materials. Results showed that, while both groups improved, students who completed three to five sessions improved significantly more across all measures and tasks with trained phrases (except comprehensibility in reading), and in comprehensibility for untrained phrases in the imitation task, compared to the zero to two sessions group. A satisfaction questionnaire at post-test showed that 53.9% of participants felt the training had improved their pronunciation, and 79.5% were satisfied with the learning experience. The results suggest that a short karaoke training using familiar songs can effectively enhance second language pronunciation in adolescent learners.
  • Open AccessItem type: Item ,
    Exploring cross-linguistic contextual predictors of colexification
    (2024) Gómez Zuluaga, Ángela María
    Colexification, a linguistic phenomenon where one word expresses multiple meanings, permeates natural language. Despite its ubiquity, its underlying mechanisms are still largely unknown. This study explores the effect of different proxies of context confusability, specifically cosine similarity and symmetrized Kullback-Leibler divergence, through a cross-linguistic analysis using contextualized embeddings for English and Spanish data. By analyzing the distribution of raw data and fitting various logistic regression models, we examined the behavior of these measures in the context of colexification and tested their predictive capacity. Our findings are consistent across both English and Spanish, with slight variations between them. Both predictors were found to hold information for the prediction of colexification, effectively working as proxies of contextual confusability through all sets of contextualized embeddings. This study contributes to a deeper understanding of the circumstances under which colexification occurs, offering new insights into the factors influencing this linguistic phenomenon.
  • Open AccessItem type: Item ,
    Scrutinizing the predictive power of large language models for brain function
    (2024) Yang, Ni
    Based on predictive coding and hierarchical processing as a commonality between large language models (LLMs) and the brain, many studies have linked the two by regressing brain activity on LLMs’ representations. However, increasing evidence has revealed problems in this new line of research. To address this issue, we attempted to replicate a pioneering study (Kumar et al., 2022) on an independent fMRI dataset with several methodological adaptations. Results showed overall low correlation scores and sparse predictions across the cortex. Contrary to the reference study, representation’s performances across most ROIs did not differ significantly. However, in areas where significant differences were observed, fastText consistently outperformed BERT. Additionally, the layer-wise performance of embeddings and transformations showed no consistent patterns. Our findings challenge the existint assumptions regarding the predictive power of LLMs for brain function and highlight potential issues in the current methodologies to map predictions from LLMs onto brain activations.
  • Open AccessItem type: Item ,
    Exploring geometric compression across languages in multilingual language models
    (2024) Ruiz Moreno, Eder
    This study explores geometric compression of linguistic data across languages in multilingual language models using the Europarl corpus, focusing on three models: BLOOM, XLMRoBERTa, and Mistral. We estimate the intrinsic dimension (ID) of hidden representations at each layer to quantify geometric compression. In Transformer-based LMs, the last hidden representation arises from a series of intermediate representations computed through a number of identical modules. Our analysis reveals that the ID of these representations is significantly smaller than the ambient dimension, with distinct compression patterns across languages. Languages from the same family exhibit similar ID amplitudes, suggesting that shared linguistic properties impact the dimensionality of model representations. Additionally, we find that the model’s performance on a language correlates with the ID amplitude at the first highdimensionality phase, indicating that the learned linguistic properties influence compression. These findings complement those found in other studies, bringing new insights to our understanding of how state-of-the-art LLMs process and compress linguistic data in different languages.
  • Open AccessItem type: Item ,
    Contextual variation in predicting adjective ordering: an information-theoretical measure for meaning specificity
    (2024) Ju, Sijie
    When multiple adjectives are used to modify an object, their ordering generally follows a certain fixed pattern. This rule of order exists across languages and has attracted many linguists to seek a possible general explanation for it. Meaning specificity has been proven to be a reliable predictor of English adjective ordering. In this study, we propose contextual variation as a new measure to quantify the meaning specificity of adjectives. We applied this measure to both Mandarin and English adjectives and evaluated its effectiveness in predicting their order. Our model shows to be an effective predictor in Mandarin but not in English. Upon analyzing the contexts in which adjectives are used, we discovered that polysemy is the primary factor contributing to the outliers in measuring specificity.
  • Open AccessItem type: Item ,
    The Use of referential iconic character-viewpoint gestures as an indicator of child-directed narrative speech
    (2024) Li, Jiali
    According to the audience design model, speakers adjust their communication strategies according to the cognitive level and needs of the audience to achieve more effective communication results. Among the various factors that affect audience design, the age of the audience is particularly important. This study investigates how co-speech gestures are used in child-directed speech (CDS) and adult-directed speech (ADS) storytelling tasks. Following a within-subjects design, 40 young adult participants (35 females and 5 males) from the TEACH-TALK corpus were asked to tell the same story to both simulated adult and child audiences. All gestures (N = 3255) were coded according to referentiality (referentials vs. non-referentials). Referential gestures were further classified according to specific dimensions of referentiality (iconicity, deixis, and metaphoricity). Specifically, referential iconic gestures were classified according to viewpoint (character-viewpoint or CVPT, observer-viewpoint or OVPT, dual-viewpoint or DVPT, and narrator-viewpoint or NVPT) and informativeness (redundant vs. non-redundant). The results showed no significant differences in the overall gesture rate, gesture referentiality, and gesture informativeness between ADS and CDS conditions. However, speakers used a higher rate of referential iconic gestures in CDS compared to ADS, and crucially, the rate of CVPT iconic gestures was significantly higher in CDS than in ADS. Our results suggest that speakers adjust their gesture use according to the needs of their audience; specifically, they are more likely to use referential iconic gestures, particularly those from a CVPT perspective, in narrative speech with children than with adults.
  • Open AccessItem type: Item ,
    Multimodal communication in child-directed and adult-directed narrative speech: Speakers’ body movements and prosodic variance
    (2024) Defty, Lars
    It has been demonstrated that adults adapt their speech when speaking to infants and children (child-directed speech or CDS) in comparison with adult-directed speech (ADS), exhibiting increased vocal pitch and pitch variation, and exaggerating use of co-speech gestures such as head movements and manual gestures. It has also been observed that the vocal aspects (i.e., speech itself) and the visual aspects (gesture) of human communication work in nuanced coordination with one another. However, whether augmentations of vocal cues in CDS are directly related to augmentations of gesture is unclear. The current study examined the relationship between augmentations of vocal and visual cues in CDS using TEACH-TALK, an audiovisual corpus of 40 future teachers who told the same story in two simulated conditions (ADS and CDS). Results revealed that vocal pitch, vocal pitch variance, and rates of body movement were clearly augmented in the CDS condition. Results showed no statistically meaningful correlations between vocal pitch variance and rates of body movement in ADS or in CDS, and no statistically significant relationship between augmentations in vocal pitch variance and body movement from ADS to CDS. These findings highlight the complex nature of speech-gesture coordination and the need for further study in order to better understand speech-gesture relationships in different communicative registers.
  • Open AccessItem type: Item ,
    Using context variation indexes for the detection of semantic neologisms
    (2024) Moset Estruch, Laura
    Semantic neologism detection is a complex issue in the field. SNs do not instil a feeling of novelty like formal neologisms because they already exist in the language, their change is only at a semantic level. This difficulty is prevalent when automating the identification process. The literature on the topic defends that the term’s context is the best tool for detection, as the term changes meaning, so do the words it collocates with. This study presents a new method of automatic detection that uses context variation indexes calculated employing diachronic corpora. This approach is much simpler than other state-of-the-art procedures because it depends less on other computational tools and requires less software maintenance. The method is tested on 100 Spanish and Catalan terms. All tests demonstrated consistent performance. The results highlight the importance of term frequency and usage frequency of the neological sense as key factors for its success.
  • Open AccessItem type: Item ,
    Building a dataset of emotions with distant supervision
    (2024) Schaefer Trindade, Luísa
    In Natural Language Processing (NLP), emotion detection is a challenging problem of text classification. Using supervised machine learning to tackle this task requires annotated datasets, which can be difficult to come by because they are costly to produce. Moreover, emotions are subjective, and human annotators often disagree in their assessments. Recently, many methods have been proposed to reduce costs, including distant supervision. This thesis presents a strategy for annotating emotions in literary works in Brazilian Portuguese. Using a combination of regular expressions for automatic dialogue extraction, SpaCy, and a lexicon containing 26 emotions, we classify dialogue by considering words used by the narrator to introduce and describe it. The results are mixed, given the large set of emotion labels, many of which are underrepresented in our data collection efforts. However, this strategy can still benefit the annotation of literary corpora with more common emotions such as Happiness and Dissatisfaction.
  • Open AccessItem type: Item ,
    A descriptive study of the Brazilian neologisms sextou, trintou, and other morphosemantically similar words
    (2024) Salgueiro Zorman de Menezes, Anna Lívia
    This thesis presents a description of the main linguistic features of some neologisms used in informal contexts in Brazilian Portuguese: sextou (derived from sexta-feira, ‘Friday’), trintou (derived from trinta, ‘thirty’) and other morphosemantically similar words. They are identical in form to a past tense verb in 2nd and 3rd person singular but are not pragmatically used in agreement with this tense, person or number -- a corpus study was carried out to show that. Their morphological creation process is argued to be derivation, and they convey not only the meaning of their base noun, but also the emotional feeling speakers get from it – e.g., the joy of being on a Friday and having the weekend ahead to rest. This meaning is accounted for by appealing to context and common ground between speakers and hearers. The proposed account is contrasted with the one existing account in the literature.