Navigate
Browse
Recent Submissions

Item type: Item , Open-Domain Zero-Shot Audio Tagging: Evaluation via Semantic Embeddings(2025) Yapici, TolgaThis thesis investigates open-domain zero-shot audio tagging on the BSD10k dataset, a curated heterogeneous subset of Freesound, using Contrastive Language–Audio Pretraining (CLAP) audio embeddings. To reduce the impact of rare and noisy labels, we apply a document frequency (DF) weighting scheme, which leads to substantial performance gains. We further introduce a semantic evaluation approach based on SBERT text embeddings, which captures semantically valid tags missed by exact string matching. This yields notable gains across systems, with the largest improvements in the baseline model and consistent improvements for both the DFweighted variant and Freesound’s supervised tag recommender used for comparison. Together, the tag weighting and semantic evaluation demonstrate performance improvements beyond standard metrics. While the results show clear advances, zeroshot tagging with CLAP remains limited by incomplete generalization to folksonomy labels and sparse annotation coverage. Nevertheless, this work highlights the potential of zero-shot approaches to enable consistent and standardized audio annotation directly from raw audio.
Item type: Item , SurpriseLSTM: Neural Modeling of Musical Expectation and Surprise in Monophonic Melodies(2025) Lissenko, TanguyWeintroduce SurpriseLSTM, an LSTM network that predicts note-by-note surprisal in strictly monophonic, symbolic melodies. We conduct the first systematic comparisons of a neural expectancy model against the symbolic IDyOM model and the audio-based AudioIC algorithm across five complementary validation paradigms: large-scale correlations on Western melody corpora, two-note pleasantness ratings, Bach-chorale surprise profiles, and the musical Wundt effect. SurpriseLSTM matches or exceeds the performance of IDyOM in aligning with human surprise judgments. A detailed context analysis reveals distinct differences in how neural and statistical models respond to authentic cadences versus non-cadential situations, with SurpriseLSTM showing superior containment rates and distribution fitting in predicting human scale degree expectations. These findings demonstrate that recurrent neural networks, trained on basic note-level features, can accurately capture the cognitive principles of statistical learning and probabilistic prediction in music perception. Code and a pre-trained model are available at https://github.com/lissenko/surprise-lstm
Item type: Item , Understanding Audio Source Separation in Carnatic Music with Multimodal Data(2025) Fuhrmann, ThéoThis thesis investigates the internal mechanisms of audio-visual source separation models, focusing on how performer motion guides source separation in Carnatic music. To move beyond "black box" performance metrics, we employ a dual approach combining model-independent analysis of audio-visual synchrony with model-specific interpretability techniques applied to Voice-Vision Transformer (VoViT) architectures. A cross-correlation and sliding-window analysis on the Saraga dataset first establishes instrument-specific temporal patterns, revealing that targeted instrument specific motion features exhibit stronger and more consistent correlations with audio dynamics than general body motion. While global linear audio-visual synchrony is weak, these analyses highlight the importance of localised and instrument-specific motion cues. Subsequently, we apply gradient-based saliency methods to a vocal separation model, demonstrating its primary reliance on facial keypoints. Ablation studies causally confirm that these facial regions are crucial for vocal separation, while body motion contributes minimally. We further analyze the attention mechanisms of a vocal & violin model to understand how it disentangles spectrally overlapping sources, specifically through a revised hybrid fusion architecture. FiLM (Feature-wise Linear Modulation) analysis reveals that visual information causally modulates audio features by consistently amplifying or suppressing them, acting as a dynamic gating mechanism. This research provides a framework for interpreting gesture-based audio-visual systems, offers novel insights into audio-visual learning, and contributes to the development of more transparent and culturally-aware Music Information Retrieval technologies. Code available at: https://github.com/theofuhrmann/masters-thesis
Item type: Item , “Revisiting Meter Tracking in Carnatic Music using Deep Learning Approaches”(2025) Prabhu, SatyajeetBeat and downbeat tracking, jointly referred to as Meter Tracking, is a fundamental task in Music Information Retrieval (MIR). Deep learning models have far surpassed traditional signal processing and classical machine learning approaches in this domain, particularly for Western (Eurogenetic) genres, where large annotated datasets are widely available. These systems, however, perform less reliably on underrepresented musical traditions. Carnatic music, a rich tradition from the Indian subcontinent, is renowned for its rhythmic intricacy and unique metrical structures (t¯al .as). The most notable prior work on meter tracking in this context employed probabilistic Dynamic Bayesian Networks (DBNs). The performance of state-of-the-art (SOTA) deep learning models on Carnatic music, however, remains largely unexplored. In this study, we evaluate two models for meter tracking in Carnatic music: the Temporal Convolutional Network (TCN), a lightweight architecture that has been successfully adapted for Latin rhythms, and Beat This!, a transformer-based model designed for broad stylistic coverage without the need for post-processing. Replicating the experimental setup of the DBN baseline on the Carnatic Music Rhythm (CMRf) dataset, we systematically assess the performance of these models in a directly comparable setting. We further investigate adaptation strategies, including f ine-tuning the models on Carnatic data and the use of musically informed parameters. Results show that while off-the-shelf models do not always outperform the DBN, their performance improves substantially with transfer learning, matching or surpassing the baseline. These findings indicate that SOTA deep learning models can be effectively adapted to underrepresented traditions, paving the way for more inclusive and broadly applicable meter tracking systems.
Item type: Item , Comparison of Audio Encoders for Audio-Text Contrastive Learning Representations(2025) Cárdenas Gracia, SergioThis project investigates contrastive learning techniques for aligning audio and text representations in the music domain, focusing on scenarios with limited data and computational resources. We provide a comprehensive review of existing methods relevant to music-text contrastive learning. Two audio encoders, HTSAT and MAEST, initialized with pretrained weights, are integrated with a frozen RoBERTa text encoder within the LAION-AI CLAP framework and fine-tuned on the MTGJamendo dataset. Model performance is evaluated on three tasks: zero-shot genre classification on the GTZAN dataset, multi-label tag classification on the MagnaTagATune dataset, and text-to-music retrieval on the Song Describer dataset. Results show that HTSAT generalizes better in low-data settings, while MAEST tends to overfit, highlighting the impact of encoder complexity in resource-constrained environments. Attempts to mitigate MAEST’s overfitting with weight decay and learning rate decay were unsuccessful. Additionally, the study highlights the critical role of data volume and batch size in contrastive learning effectiveness. The source code for this work is publicly available at https://github.com/SerX610/smc-master-thesis
Item type: Item , Neural Engine Sound Synthesis with Physics-Informed Inductive Biases and Differentiable Signal Processing(2025) Doerfler, RobinEngine sound synthesis is increasingly important in automotive audio and interactive media, yet presents unique challenges for neural audio generation that distinguish it from musical audio paradigms. Unlike sustained musical tones, where periodic oscillations exist inherently in the acoustic vibration, engine sounds emerge from sequential combustion events that generate sharp pressure transients recurring at rates from 600 to over 8000 RPM. This creates acoustic phenomena exhibiting significant inharmonicity, extremely low fundamental frequencies—down to 5 Hz—and rapid temporal sequences with intervals below 2 milliseconds, demanding approaches that can model both precision in timing and complexity in timbral evolution, beyond conventional musical audio assumptions. While existing differentiable digital signal processing (DDSP) methods have demonstrated success across various audio synthesis tasks, they often rely on generic synthesis modules that do not explicitly recognize or incorporate the acoustic principles and physical mechanisms underlying engine sounds. This thesis presents a novel approach to engine sound synthesis through systematic integration of physics-informed inductive biases within the entire differentiable synthesis pipeline. It proposes the Procedural Engines Model (PRCE), a deep learning architecture that combines time-varying embeddings of RPM and torque parameters– including their temporal derivatives– and derived conditioning signals– throttle position and deceleration fuel cutoff (DFCO)– with specialized model heads for physics-informed parameter conversion driving two custom differentiable synthesizer configurations that incorporate domain-specific acoustic principles. To guide learning toward accurate engine timbre reproduction, a custom loss function is introduced that prioritizes spectral energy near engine-order harmonics, drawing inspiration from Campbell diagrams commonly used in noise, vibration, and harshness (NVH) analysis. Engine sounds present a fundamental duality: while in reality a sum of structured noise-like pressure pulses, they manifest as distinctly harmonic acoustic phenomena.This motivates two complementary synthesis strategies that provide contrasting optimization pathways toward the same acoustic target: direct spectral-temporal reconstruction that implicitly reflects the underlying pulse structure, and explicit pulse sequence modeling through acoustic simulation of individual combustion events, their temporal alignment and exhaust system propagation. The PRCE framework implements both perspectives as two configurations. The Harmonic-Plus-Noise (HPN) variant employs modified harmonic synthesis with systematic inharmonicity and temporal-spectral structuring of noise components to model observable acoustic characteristics. The Pulse-Train-Resonator (PTR) conf iguration directly models physical–acoustic phenomena by composing combustion pulses aligned to engine firing patterns and propagating them through differentiable resonator networks simulating exhaust acoustics. Evaluation on procedurally generated engine sound datasets totaling 2.5 hours across varied operating conditions reveals complementary strengths between synthesis approaches. PTR achieves modestly superior validation performance (5.7% improvement in total loss) and demonstrates more consistent training-validation transfer, while HPN shows greater flexibility across diverse engine configurations and robustness to harmonic irregularities. Both variants successfully capture authentic engine acoustic behaviors despite distinct synthesis strategies and their audible signatures. This research demonstrates systematic integration of physics-informed inductive biases into differentiable synthesis architectures, providing a methodological framework applicable to physically-constrained audio generation beyond automotive contexts. The work reveals that domain-specific biases produce distinct acoustic signatures that influence both optimization strategies and perceptual outcomes. To support future research, we openly publish the Procedural Engines Dataset, a comprehensive collection of procedurally generated engine audio with time-aligned control annotations and the complete PRCE model pipeline.
Item type: Item , InScoreAI: Collaborative Score Inpainting with Anticipatory Transformers(2025) Lallana Babiloni, ManuelCollaborative music composition is the process by which multiple individuals contribute to the creation of a musical work. In this project, an interface is created to help composers create collaboratively and reach consensus through flexible symbolic music generation using Anticipatory Transformers. This model generates MIDI inside a specific fragment of music taking into account the previous and following tokens. The central and most important part of the project are musicians and composers, they’ve been the focus from the start to the development and evaluation. A preliminary survey for experienced musicians and composers was developed to extract important features for the interface, the evaluation consisted of three tasks, one where the interface was used without the AI tools, one with the AI tools and the last one using the collaborative mode with AI tools. Results show that AI assistance significantly reduces the compositional effort and improves self-efficacy in individual workflows, but diminishes perceived ownership. In collaborative settings, the system effectively resolves interpersonal friction through AI-mediated "idea bridging," fostering consensus and mutual understanding, sense of ownership is higher compared to an individual AI setting though it is still a concern for some participants based on long responses. Professional composers showed concerns about AI making composers “lazy” and the works simple and predictable, while nonprofessionals outlined educational potentials for music collaborative settings.
Item type: Item , MuSA: A New TEL Platform for Enhancing Self-Reflection and Musical Understanding through Saliency Analysis of Performance Recordings(2025) Oktay, IsabelleThis thesis explores how a technology-enhanced learning (TEL) tool can improve music practice by addressing a critical, often-neglected component of skill development: the reflection phase. It focuses on the development and evaluation of MuSA (Musical Salience Analyzer), an application designed to provide a pedagogicallygrounded platform for analyzing recorded performances to make reflection more efficient and effective. MuSA’s design is informed by key educational theories, including the Talent-Development-in-Achievement-Domains (TAD) Music Model and learner-centered teaching (LCT) principles like scaffolding, self-regulated learning (SRL), and self-directed learning (SDL). Its central feature is saliency analysis, which algorithmically identifies key moments in a performance based on variability in musical features such as pitch, dynamics, and tempo. Unlike tools that offer prescriptive, "correct/incorrect" feedback, MuSA encourages a learner’s own interpretation. As an accessible, web-based platform, it allows users to upload or record audio for analysis independently of a teacher. To evaluate MuSA’s effectiveness, a mixed-methods, within-subjects study was conducted with 14 participants. While the study’s small sample size limited statistical power, the findings pointed to several exploratory trends. The data suggested a differential impact based on musical feature and experience level, with dynamics showing the most consistent trend toward objective and perceived improvement. The analysis also suggested a potential expertise reversal effect, where trends showed intermediate musicians gaining from the feedback while advanced musicians experienced neutral or slightly negative changes. Furthermore, the study’s self-awareness metrics indicated a general misalignment between participants’ self-ratings and objective performance, highlighting a core challenge in the self-reflection phase of independent practice. In conclusion, MuSA offers a potential contribution to TEL for music by leveraging computational analysis to provide targeted insights that can scaffold the reflective process. Although the quantitative results were inconclusive, positive qualitative feedback validates the demand for such a tool. This work provides a functional prototype and a research infrastructure for collecting labeled recording data, demonstrating the dual role ofMuSA as both a learning support system and a potential research tool.
Item type: Item , Interactive machine learning for music classification(2025) Alexandrovich Danilin, DanilaAudio embeddings are a promising approach to music representation, in part thanks to their ability to extract complex patterns from audio data; the predictive power of audio embeddings is utilized for a semantically meaningful, two-dimensional visualization of music data in a user interface (UI) which has been developed as part of this thesis research. As a contribution to ongoing research on the intersection between music information retrieval (MIR) and interactive machine learning (IML), the UI allows users to iteratively train a classifier for numerous audio classification tasks. As part of this research, the certainty-based class prediction uncertainty (CPU) heuristic, and the dataset coverage (DC) heuristic are proposed; these heuristics are shown to identify informative samples in music collections, and their efficiency is objectively evaluated by means of simulated, iterative active learning (AL) classification tasks for 6 different embedding-dataset pairs. The objective evaluations have shown promising results, in which high classification accuracies are shown to be achieved in fewer iterations in AL classification tasks.
Item type: Item , Real-time Generation of Percussive Rhythms Using Descriptors(2025) Vilanova, AlexandreA fundamental challenge in computational music generation lies in developing control interfaces that provide intuitive, musically meaningful interactions with generative systems. This thesis addresses this challenge specifically for rhythmic generation, focusing on the development of a system capable of generating 16-step monophonic rhythmic patterns in real time using musically intuitive controls. Our method uses perceptually grounded rhythmic descriptors as an expressive, intuitive control space. A neural network is trained on all possible binary 16-step monophonic patterns, learning to map from descriptor space back to rhythmic patterns. We compare this descriptor-based approach to a variational autoencoder model and find the former more effective for usability and expressive control. An interactive interface is developed for exploration and testing, followed by quantitative and qualitative experiments evaluating the smoothness and user intuitiveness of the system. Findings show that the descriptor-based model aligns well with listener perception, balancing usability with expressive flexibility. While limited to monophonic rhythms, the system establishes descriptors as a strong foundation for extending interactive rhythm generation to polyphonic and more complex domains.
Item type: Item , Improving the Semantic Structure of Neural Audio Codecs(2025) Monsalve Fernández, ÁngelNeural audio codecs have achieved remarkable compression efficiency by learning latent representations optimized for waveform fidelity. However, these codecs often lack explicit semantic structure, limiting their effectiveness for downstream tasks that require meaningful audio abstractions. Query-based compression, as introduced by ALMTokenizer, offers a path to infuse global context into discrete audio tokens by interleaving learnable [CLS] embeddings among frame-level features and leveraging Transformer attention to aggregate semantic information. This thesis implements a reproducible pipeline that adapts the ALMTokenizer paradigm using a frozen EnCodec front-end. By inserting one [CLS] query token every w frames, the model enables bitrate-on-demand through a tunable window length, while a Transformer encoder–decoder architecture captures long-range dependencies and reconstructs waveforms via a paired decoder. Quantization layers are omitted in this implementation to focus analysis on the raw contextual embeddings. To assess the semantic organization of the resulting latent space, we extract [CLS] embeddings from the Good-sounds dataset and perform an evaluation of the resulting latents. Our analyses show that although ALMTokenizer reconstructions lag behind EnCodec in perceptual quality, its embeddings exhibit stronger semantic organization. Clustering, projection, and classification experiments reveal clearer groupings by instrument, note, and octave, while interpolation suggests smoother latent transitions. This highlights a trade-off: EnCodec excels at fidelity, whereas ALMTokenizer provides embeddings better suited for semantic tasks. By releasing the implementation and methodology, this thesis offers a foundation for future research on semantically structured audio codecs.
