Exploring geometric compression across languages in multilingual language models

Mostra el registre complet Registre parcial de l'ítem

  • dc.contributor.author Ruiz Moreno, Eder
  • dc.date.accessioned 2025-03-07T16:01:12Z
  • dc.date.available 2025-03-07T16:01:12Z
  • dc.date.issued 2024
  • dc.description Treball de fi de màster en Lingüística Teòrica i Aplicada. Director: Dr. Corentin Kervadec
  • dc.description.abstract This study explores geometric compression of linguistic data across languages in multilingual language models using the Europarl corpus, focusing on three models: BLOOM, XLMRoBERTa, and Mistral. We estimate the intrinsic dimension (ID) of hidden representations at each layer to quantify geometric compression. In Transformer-based LMs, the last hidden representation arises from a series of intermediate representations computed through a number of identical modules. Our analysis reveals that the ID of these representations is significantly smaller than the ambient dimension, with distinct compression patterns across languages. Languages from the same family exhibit similar ID amplitudes, suggesting that shared linguistic properties impact the dimensionality of model representations. Additionally, we find that the model’s performance on a language correlates with the ID amplitude at the first highdimensionality phase, indicating that the learned linguistic properties influence compression. These findings complement those found in other studies, bringing new insights to our understanding of how state-of-the-art LLMs process and compress linguistic data in different languages.
  • dc.identifier.uri http://hdl.handle.net/10230/69855
  • dc.language.iso eng
  • dc.rights Llicència Creative Commons, Reconeixement-NoComercial-SenseObraDerivada 4.0 Internacional
  • dc.rights.accessRights info:eu-repo/semantics/openAccess
  • dc.rights.uri https://creativecommons.org/licenses/by-nc-nd/4.0/deed.ca
  • dc.subject.keyword LLMs
  • dc.subject.keyword Transformers
  • dc.subject.keyword Multilingual LMs
  • dc.subject.keyword Geometric compression
  • dc.subject.keyword Intrinsic dimension
  • dc.title Exploring geometric compression across languages in multilingual language models
  • dc.type info:eu-repo/semantics/masterThesis