Semantic consistency in RAG: evaluating modern encoder-only models on active and passive voice in english and russian

Enllaç permanent

Descripció

Resum
Retrieval-Augmented Generation (RAG) systems depend on dense embeddings to retrieve relevant context for open-domain question answering. A critical requirement for these embeddings is semantic consistency – the ability to remain stable across meaning-preserving variation. This study examines how modern encoder-only models handle active/passive voice alternations in English and Russian. Using a bilingual dataset of 500 factual question pairs, we evaluate semantic consistency (Overlap@K) and retrieval quality (MRR, Recall@K) in raw and fine-tuned versions of EuroBERT and RuModernBERT. Findings show that representations of raw encoders are only partially semantic: they are sensitive to word order, morphology, and query length. Consistency was significantly higher in English, indicating that morphologically rich languages like Russian are more challenging. EuroBERT performed poorly on Russian due to limited training exposure and subword fragmentation. RuModernBERT performed better on Russian passives, likely reflecting its training data. Contrastive fine-tuning substantially improved performance, though not all fine-tuned models benefited equally – EuroBERT_FT and LaBSE showed limitations tied to tokenization and training objectives.
Descripció
Treball de fi de màster en Lingüística Teòrica i Aplicada. Directora: Dra. Núria Bel
Col·leccions
Màster en Lingüística Teòrica i Aplicada. Treballs de fi de màster de recerca

Fitxers