UPF Digital Repository
Guides
Recent Submissions
This thesis investigates open-domain zero-shot audio tagging on the BSD10k dataset, a curated heterogeneous subset of Freesound, using Contrastive Language–Audio Pretraining (CLAP) audio embeddings. To reduce the impact of rare and noisy labels, we apply a document frequency (DF) weighting scheme, which leads to substantial performance gains. We further introduce a semantic evaluation approach based on SBERT text embeddings, which captures semantically valid tags missed by exact string matching. This yields notable gains across systems, with the largest improvements in the baseline model and consistent improvements for both the DFweighted variant and Freesound’s supervised tag recommender used for comparison. Together, the tag weighting and semantic evaluation demonstrate performance improvements beyond standard metrics. While the results show clear advances, zeroshot tagging with CLAP remains limited by incomplete generalization to folksonomy labels and sparse annotation coverage. Nevertheless, this work highlights the potential of zero-shot approaches to enable consistent and standardized audio annotation directly from raw audio.
(2025) Yapici, TolgaWeintroduce SurpriseLSTM, an LSTM network that predicts note-by-note surprisal in strictly monophonic, symbolic melodies. We conduct the first systematic comparisons of a neural expectancy model against the symbolic IDyOM model and the audio-based AudioIC algorithm across five complementary validation paradigms: large-scale correlations on Western melody corpora, two-note pleasantness ratings, Bach-chorale surprise profiles, and the musical Wundt effect. SurpriseLSTM matches or exceeds the performance of IDyOM in aligning with human surprise judgments. A detailed context analysis reveals distinct differences in how neural and statistical models respond to authentic cadences versus non-cadential situations, with SurpriseLSTM showing superior containment rates and distribution fitting in predicting human scale degree expectations. These findings demonstrate that recurrent neural networks, trained on basic note-level features, can accurately capture the cognitive principles of statistical learning and probabilistic prediction in music perception. Code and a pre-trained model are available at https://github.com/lissenko/surprise-lstm
(2025) Lissenko, TanguyThis thesis investigates the internal mechanisms of audio-visual source separation models, focusing on how performer motion guides source separation in Carnatic music. To move beyond "black box" performance metrics, we employ a dual approach combining model-independent analysis of audio-visual synchrony with model-specific interpretability techniques applied to Voice-Vision Transformer (VoViT) architectures. A cross-correlation and sliding-window analysis on the Saraga dataset first establishes instrument-specific temporal patterns, revealing that targeted instrument specific motion features exhibit stronger and more consistent correlations with audio dynamics than general body motion. While global linear audio-visual synchrony is weak, these analyses highlight the importance of localised and instrument-specific motion cues. Subsequently, we apply gradient-based saliency methods to a vocal separation model, demonstrating its primary reliance on facial keypoints. Ablation studies causally confirm that these facial regions are crucial for vocal separation, while body motion contributes minimally. We further analyze the attention mechanisms of a vocal & violin model to understand how it disentangles spectrally overlapping sources, specifically through a revised hybrid fusion architecture. FiLM (Feature-wise Linear Modulation) analysis reveals that visual information causally modulates audio features by consistently amplifying or suppressing them, acting as a dynamic gating mechanism. This research provides a framework for interpreting gesture-based audio-visual systems, offers novel insights into audio-visual learning, and contributes to the development of more transparent and culturally-aware Music Information Retrieval technologies. Code available at: https://github.com/theofuhrmann/masters-thesis
(2025) Fuhrmann, ThéoBeat and downbeat tracking, jointly referred to as Meter Tracking, is a fundamental task in Music Information Retrieval (MIR). Deep learning models have far surpassed traditional signal processing and classical machine learning approaches in this domain, particularly for Western (Eurogenetic) genres, where large annotated datasets are widely available. These systems, however, perform less reliably on underrepresented musical traditions. Carnatic music, a rich tradition from the Indian subcontinent, is renowned for its rhythmic intricacy and unique metrical structures (t¯al .as). The most notable prior work on meter tracking in this context employed probabilistic Dynamic Bayesian Networks (DBNs). The performance of state-of-the-art (SOTA) deep learning models on Carnatic music, however, remains largely unexplored. In this study, we evaluate two models for meter tracking in Carnatic music: the Temporal Convolutional Network (TCN), a lightweight architecture that has been successfully adapted for Latin rhythms, and Beat This!, a transformer-based model designed for broad stylistic coverage without the need for post-processing. Replicating the experimental setup of the DBN baseline on the Carnatic Music Rhythm (CMRf) dataset, we systematically assess the performance of these models in a directly comparable setting. We further investigate adaptation strategies, including f ine-tuning the models on Carnatic data and the use of musically informed parameters. Results show that while off-the-shelf models do not always outperform the DBN, their performance improves substantially with transfer learning, matching or surpassing the baseline. These findings indicate that SOTA deep learning models can be effectively adapted to underrepresented traditions, paving the way for more inclusive and broadly applicable meter tracking systems.
(2025) Prabhu, SatyajeetThis project investigates contrastive learning techniques for aligning audio and text representations in the music domain, focusing on scenarios with limited data and computational resources. We provide a comprehensive review of existing methods relevant to music-text contrastive learning. Two audio encoders, HTSAT and MAEST, initialized with pretrained weights, are integrated with a frozen RoBERTa text encoder within the LAION-AI CLAP framework and fine-tuned on the MTGJamendo dataset. Model performance is evaluated on three tasks: zero-shot genre classification on the GTZAN dataset, multi-label tag classification on the MagnaTagATune dataset, and text-to-music retrieval on the Song Describer dataset. Results show that HTSAT generalizes better in low-data settings, while MAEST tends to overfit, highlighting the impact of encoder complexity in resource-constrained environments. Attempts to mitigate MAEST’s overfitting with weight decay and learning rate decay were unsuccessful. Additionally, the study highlights the critical role of data volume and batch size in contrastive learning effectiveness. The source code for this work is publicly available at https://github.com/SerX610/smc-master-thesis
(2025) Cárdenas Gracia, Sergio



