Semantic consistency in RAG: evaluating modern encoder-only models on active and passive voice in english and russian
Mostra el registre complet Registre parcial de l'ítem
- dc.contributor.author Fomicheva, Marina
- dc.date.accessioned 2025-07-30T15:21:19Z
- dc.date.available 2025-07-30T15:21:19Z
- dc.date.issued 2025
- dc.description Treball de fi de màster en Lingüística Teòrica i Aplicada. Directora: Dra. Núria Bel
- dc.description.abstract Retrieval-Augmented Generation (RAG) systems depend on dense embeddings to retrieve relevant context for open-domain question answering. A critical requirement for these embeddings is semantic consistency – the ability to remain stable across meaning-preserving variation. This study examines how modern encoder-only models handle active/passive voice alternations in English and Russian. Using a bilingual dataset of 500 factual question pairs, we evaluate semantic consistency (Overlap@K) and retrieval quality (MRR, Recall@K) in raw and fine-tuned versions of EuroBERT and RuModernBERT. Findings show that representations of raw encoders are only partially semantic: they are sensitive to word order, morphology, and query length. Consistency was significantly higher in English, indicating that morphologically rich languages like Russian are more challenging. EuroBERT performed poorly on Russian due to limited training exposure and subword fragmentation. RuModernBERT performed better on Russian passives, likely reflecting its training data. Contrastive fine-tuning substantially improved performance, though not all fine-tuned models benefited equally – EuroBERT_FT and LaBSE showed limitations tied to tokenization and training objectives.en
- dc.identifier.uri http://hdl.handle.net/10230/71040
- dc.language.iso eng
- dc.rights Llicència CC Reconeixement 4.0 Internacional (CC BY 4.0)
- dc.rights.accessRights info:eu-repo/semantics/openAccess
- dc.rights.uri https://creativecommons.org/licenses/by/4.0/
- dc.subject.other Rus --Veu passiva
- dc.title Semantic consistency in RAG: evaluating modern encoder-only models on active and passive voice in english and russian
- dc.type info:eu-repo/semantics/masterThesis