Randomly Weighted CNNs for (Music) Audio Classification

The computer vision literature shows that randomly weighted neural networks perform reasonably as feature extractors. Following this idea, we study how non-trained (randomly weighted) convolutional neural networks perform as feature extractors for (music) audio classification tasks. We use features extracted from the embeddings of deep architectures as input to a classifier – with the goal to compare classification accuracies when using different randomly weighted architectures. By following this methodology, we run a comprehensive evaluation of the current architectures for audio classification, and provide evidence that the architectures alone are an important piece for resolving (music) audio problems using deep neural networks.


MOTIVATION -FROM PREVIOUS WORKS
Some intriguing properties of deep neural networks are periodically showing up in the scientific literature. Examples of those are: (i) perceptually non-relevant signal perturbations that dramatically affect the predictions of an image classifier [12,49]; (ii) although there is no guarantee of converging to a global minima that might generalize, image classification models perform well with unseen data [14,25]; or (iii) non-trained deep neural networks are able to perform reasonably well as image feature extractors [41,43,51]. In this work, we exploit one of the above listed properties (iii) to evaluate how discriminative deep audio architectures are before training. Previous works already explored the idea of empirically studying the qualities of non-trained (randomly weighted) networks, but mainly in the computer vision field: · Saxe et al. [43] studied how discriminative are the architectures themselves by evaluating the classification performance of SVMs fed with features extracted from nontrained (random) CNNs. 1 They showed that a surprising fraction of the performance in deep image classifiers can be attributed to the architecture alone. Therefore, the key to good performance lies not only on improving the learning algorithms but also in searching for the most suitable architectures. Further, they showed that the (classification) 1 CNNs stands for Convolutional Neural Networks. performance delivered by random CNN features is correlated with the results of their end-to-end trained counterparts -this result means, in practice, that one can bypass the time-consuming process of learning for evaluating a given architecture. We build on top of this result to evaluate current CNN architectures for audio classification.
· Rosenfeld and Tsotsos [41] fixed most of the model's weights to be random, and only allowed a small portion of them to be learned. By following this methodology, they showed a small decrease in image classification performance when these models were compared to their fully trained counterparts. Further, the performance of their non fully-trained models can be summarized as follows: DenseNet [17] ResNet [15] > VGG [48] AlexNet [25] What matches previous works reporting how these (fully trained) models perform [15,17,48], confirming the performance correlation between randomly weighted models and their trained counterparts found by Saxe et al. [43] · Adebayo et al. [1] empirically assessed the local explanations of deep image classifiers to find that randomly weighted models produce explanations similar to those produced by models with learned weights. They conclude that the architectures introduce a strong prior which affects the learned (and not learned) representations.
· Ulyanov et al. [51] also showed that the structure of a network (the non-trained architecture) is sufficient to capture useful features for the tasks of image denoising, superresolution and inpainting. They think of any designed architecture as a hand-crafted model where prior information is embedded in the structure of the network itself. This way of thinking resonates with the rationale behind the family of audio models designed considering domain knowledge (see section 2) -what denotes that in both audio and image fields it exists the interest of bringing together the end-to-end learning literature and previous research.
Few related works exist in the audio field -and every randomly weighted neural network we found in the audio literature was a mere baseline [2,7,24]. Inspired by previous computer vision works, we study which audio architectures work the best via evaluating how nontrained CNNs perform as feature extractors. To this end, we use the CNNs' embeddings to construct feature vectors for a classifier -with the goal to compare classification performances when different randomly weighted architectures are used to extract features. To the best of our knowledge, this is the first comprehensive evaluation of randomly weighted CNNs for (music) audio classification. Extreme learning machines (ELMs) [18,32,47] and echo state networks (ESNs) [19] are also closely related to our work. In short, ELMs are classification/regression models 2 that are based on a single-layer feed-forward neural network with random weights. They work as follows: first, ELMs randomly project the input into a latent space; and then, learn how to predict the output via a least-square fit. More formally, we aim to predict: where W 1 is the (randomly weighted) matrix of input-tohidden-layer weights, σ is the non-linearity, W 2 is the matrix of hidden-to-output-layer weights, and X represents the input. The training algorithm is as follows: 1) set W 1 with random values; 2) estimate W 2 via a least-squares fit: where + denotes the Moore-Penrose inverse. Since no iterative process is required for learning the weights, training is faster than stochastic gradient descent [18]. Provided that we process audio signals with randomly weighted CNNs, ELM-based classifiers are a natural choice for our study -so that all the pipeline (except the last layer) is based on random projections that are only constrained by the structure of the neural network. Although ELMs are not widely used by the audio community, they have been used for speech emotion recognition [13,21], or for music audio classification [23,29,44]. ESNs differ from ELMs in that their random projections use recurrent connections. Given that the audio models we aim to study are not recurrent, we leave for future work using ESNs -see [16,45] for audio applications of ESNs.

ARCHITECTURES
In this work we evaluate the most used deep learning architectures for (music) audio classification. In order to facilitate the discussion around these architectures, we divide 2 Support Vector Machines are also classification/regression models. the deep learning pipeline into two parts: front-end and back-end, see Figure 2. The front-end is the part that interacts with the input signal in order to map it into a latentspace, and the back-end predicts the output given the representation obtained by the front-end. Note that one can interpret the front-end as a "feature extractor" and the backend as a "classifier". Given that we compare how several non-trained (random) CNNs perform as feature extractors, and we will use out-of-the-box classifiers to predict the classes: this literature review focuses in introducing the main deep learning front-ends for audio classification. Front-ends -These are generally conformed by CNNs [6,9,38,39,53], since these can encode efficient representations by sharing weights 3 along the signal. Figure 1 depicts six different CNN front-end paradigms, which can be divided into two groups depending on the used input signal: waveforms [9,27,53] or spectrograms [6,38,39]. Further, the design of the filters can be either based on domain knowledge or not. For example, one leverages domain knowledge when the frame-level single-shape 4 front-end for waveforms is designed so that the length of the filter is set to be the same as the window length in a STFT [9]. Or for a spectrogram front-end, it is used vertical filters to learn spectral representations [26] or horizontal filters to learn longer temporal cues [46]. Generally, a single filter shape is used in the first CNN layer [6,9,26,46], but some recent work reported performance gains when using several filter shapes in the first layer [5,34,36,38,39,53]. Using many filters promotes a more rich feature extraction in the first layer, and facilitates leveraging domain knowledge for designing the filters' shape. For exam-ple: a frame-level many-shapes front-end for waveforms can be motivated from a multi-resolution time-frequency transform 5 perspective -with window sizes varying inversely with frequency [53]; or since it is known that some patterns in spectrograms are occurring at different timefrequency scales, one can intuitively incorporate many vertical and/or horizontal filters to efficiently capture those in a spectrogram front-end [34,36,38,39]. As seen, using domain knowledge when designing the models allows to naturally connect the deep learning literature with previous relevant signal processing work. On the other hand, when domain knowledge is not used, it is common to employ a deep stack of small filters, e.g.: 3×1 in the samplelevel front-end used for waveforms [27,40,52], or 3×3 in the small-rectangular filters front-end used for spectrograms [6]. These VGG-like 6 models make minimal assumptions over the local stationarities of the signal, so that any structure can be learnt via hierarchically combining small-context representations.

METHOD
Our goal is to study which CNN front-ends work best via evaluating how non-trained models perform as feature extractors. Our evaluation is based on the traditional pipeline of features extraction + classifier. We use the embeddings of non-trained (random) CNNs as features: for every audio clip, we compute the average of each feature map (in every layer) and concatenate these values to construct a feature vector [7]. The baseline feature vector is constructed from 20 MFCCs, their ∆s and ∆∆s. We compute their mean and standard deviation through time, and the resulting feature vector is of size 120. We set the widely used MFCCs + SVM setup as baseline. To allow a fair comparison with the baseline, CNN models have ≈ 120 feature maps -so that the resulting feature vectors have a similar size as the MFCCs vector. Further, we evaluate an alternative configuration with more feature maps (≈3500) to show the potential of this approach. Model's description omit the number of filters per layer for simplicityfull implementation details are accessible online. 8

Features: randomly weighted CNNs
Except for the VGG model that uses ELUs as nonlinearities [6,8], the rest use ReLUs [10] -and we do not use batch normalization, see discussion in section 3.4. We use waveforms and spectrograms as input to our CNNs: Waveform inputs -are of ≈ 29sec (350,000 samples at 12kHz) and the following architectures are under study: · Sample-level: is based on a stack of 7 blocks that are composed by a 1D-CNN layer (filter length: 3, stride: 1) and a max-pool layer (size: 3, stride: 3) -with the exception of the input block which has no max-pooling and its 1D-CNN layer has a stride of 3 [27]. Averages to construct the feature vector are computed after every pooling layer, except for the first layer that are computed after the CNN. 5 The Constant-Q Transform [3] is an example of such transform. 6 VGG: a computer vision model based on a deep stack of 3×3 filters.
· Frame-level many-shapes: consists of a 1D-CNN layer with five filter lengths: 512, 256, 128, 64, 32 [53]. Every filter's stride is of 32 and we use same padding -to easily concatenate feature maps of the same size. Note that out of this single 1D-CNN layer, five feature maps (resulting of the different filter length convolutions) are concatenated. For that reason, every feature map needs to have the same (temporal) size. Since this model is single-layered and it might be in clear disadvantage when compared to the sample-level CNN, we increase its depth via adding three more 1D-CNN layers (length: 7, stride: 1) -where the last two layers have residual connections, and the penultimate layer's feature map is down-sampled by two (MP x2), see Figure 3. Averages to construct the feature vector are calculated for each feature map after every 1D-CNN layer.
· Frame-level: consists of a 1D-CNN layer with a filter of length 512 [9]. Stride is set to be 32 to allow a fair comparison with the frame-level many-shapes architecture. As in frame-level many-shapes, we increase the depth of the model via adding three more 1D-CNN layers -as in  Spectrogram inputs -are set to be log-mel spectrograms (spectrograms size: 1376×96 7 , being ≈ 29sec of signal). Differently from waveform models, spectrogram architectures use no additional layers to deepen single-layered CNNs because these already deliver a reasonable classification performance. Unless we state the contrary, every CNN layer used for processing spectrograms is set to have stride 1. As for the frame-level many-shapes model, we use same padding when many filter shapes are used in the same layer. The following spectrogram models are studied: · 7×96: consists of a single 1D-CNN layer with filters of length 7 that convolve through the time axis [9]. As a result: CNN filters are vertical and of shape 7×96. Therefore, these filters can encode spectral (timbral) representations. Averages to construct the feature vector are calculated for each feature map after the 1D-CNN layer.
· 7×86: consists of a single 2D-CNN layer with vertical filters of shape 7×86 [36,39]. Since its vertical shape is smaller than the input (86<96), filters can also convolve through the frequency axis -what can be seen as "pitch shifting" the filter. Consequently, 7×86 filters can encode pitch-invariant timbral representations [36,39]. Further, since the resulting activations can carry pitch-related information, we max-pool the frequency axis to get pitchinvariant features (max-pool shape: 1×11). Averages to construct the feature vector are calculated for each feature map after the max-pool layer.
· Timbral: consists of a single 2D-CNN layer with many vertical filters of shapes: 7×86, 3×86, 1×86, 7×38, 3×38, 1×38, see Figure 4 (top) [11,35,39]. These filters can also convolve through the frequency axis and therefore, these can encode pitch-invariant representations. Several filter shapes are used to efficiently capture different timbrically relevant time-frequency patterns, e.g.: kickdrums (can be captured with 7×38 filters representing subband information for a short period of time), or string ensemble instruments (can be captured with 1×86 filters representing spectral patterns spread in the frequency axis). Further, since the resulting activations can carry pitchrelated information, we max-pool the frequency axis to get pitch-invariant features (max-pool shapes: 1×11 or 1×59). Averages to construct the feature vector are calculated for each feature map after the max-pool layer.
· Temporal: several 1D-CNN filters (of lengths: 165, 128, 64, 32) operate over an energy envelope obtained via mean-pooling the frequency-axis of the spectrogram, see Figure 4 (bottom). By computing the energy envelope in that way, we are considering high and low frequencies together while minimizing the computations of the model. Observe that this single-layered 1D-CNN is not operating over a 2D spectrogram, but over a 1D energy envelopetherefore no vertical convolutions are performed, only 1D (temporal) convolutions are computed. As seen, domain knowledge can also provide guidance to (effectively) minimize the computations of the model. Averages to construct the feature vector are calculated for each feature map after the CNN layer.
· Timbral+temporal: combines both timbral and temporal CNNs in a single (but wide) layer, see Figure 4 [37]. Averages to construct the feature vector are calculated in the same way as for timbral and temporal architectures.
· VGG: is a computer vision model based on a stack of 5 blocks combining 2D-CNN layers (with small rectangular filters of 3×3) and max-pooling (of shapes: 4×2, 4×3, 5×2, 4×2, 4×4, respectively) [6]. Averages to construct the feature vector are computed after every pooling layer. For further details about the models under study: the code is accessible online 8 , and a graphical conceptualization of the models is available in Figures 1, 3

Classifiers: SVM and ELM
We study how several feature vectors (computed considering different CNNs) perform for a given set of classifiers: SVMs and ELMs. We discarded the use of other classifiers since their performance was not competitive when compared to those. SVMs and ELMs are hyper-parameter sensitive, for that reason we perform a grid search: · SVM hyper-parameters: we consider both linear and rbf kernels. For the rbf kernel, we set γ to: 2 −3 , 2 −5 , 2 −7 , 2 −9 , 2 −11 , 2 −13 , #features −1 ; and for every kernel configuration, we try several C's (penalty parameter): 0.1, 2, 8, 32. We use scikit-learn's SVM implementation [33].
· Extended Ballroom [4, 30] -4,180 songs divided in 13 classes; 10 stratified folds are randomly generated for cross-validation. We use this dataset to study how randomly weighted CNNs classify rhythm/tempo classes.
· Urban Sound 8K [42] -8732 acoustic events divided in 10 classes; 10 folds are already defined by the dataset authors for cross-validation. Since urban sounds are shorter than 4 seconds and our models accepts ≈ 29sec inputs, the signal is repeated to create inputs of the same length. We use this dataset to study how randomly weighted CNNs perform to classify natural (non-music) sounds.

Reproducing former results to discuss our method
Choi et al. [7] used random CNN features as baseline for their work, and found that (in most cases) these random CNN features perform better than MFCCs. Motivated by this result, we pursue this idea for studying how different deep architectures perform when resolving audio problems. To start, we first reproduce one of their experiments using random CNNs -under the same conditions 10 : the GTZAN dataset is split in 10 stratified folds used for crossvalidation 11 , a VGG architecture with batch normalization is employed, and the classifier is an SVM. We found that results can vary depending on the batch size if, when computing the feature vectors, layers are normalized with the statistics of current batch (batch normalization). For example: if audio-features of the same genre are batchnormalized by the same factor, one can create an artificial genre cue that might affect the results. One can observe this phenomena in Figure 5, where the best results are achieved when all songs of the same genre fill a full batch (batch size of 100). 12 We also observe that small batch sizes are harming the model's performance -see in Figure 5 when batch sizes are set to 1 and 10. And finally, when batch normalization is not used, no matter what batch size we use that the results remain the same -ANOVA test with H 0 being that averages are equal (p-value=0.491). Since it is not desirable that performance depends on the batch size, and that the feature vector of an audio depends on other audios (that are present in the batch): we decided not to use batch normalization.

Figures show average accuracies across 3 runs for every
feature type (listed on the right with the length of the feature vector in parenthesis). We use a t-test to reveal which models are performing the best -H 0 : averages are equal.  The sample-level waveform model always performs better than frame-level many-shapes (t-test: p-value 0.05). The two best performing spectrogram-based models are: tim-bral+temporal and VGG -with a remarkable performance of the timbral model alone. The timbral+temporal CNN performs better than VGG when using the ELM (≈3500) classifier (t-test: p-value=0.017); but in other cases, both models perform equivalently (t-test: p-value>0.05). Moreover, the 7x86 model performs better than 7x96 when using SVMs (t-test: p-value<0.05), but when using ELMs: 7x96 and 7x86 perform equivalently (t-test: pvalue 0.05). The best VGG and timbral+temporal models achieve the following (average) accuracies: 59.65% and 56.89%, respectively -both with an SVM (≈3500) classifier. Both models outperform the MFCCs baseline: 53.44% (t-test: p-value<0.05), but these random CNNs perform worse than a CNN pre-trained with the Million Song Dataset: 82.1% [28]. Finally, note that although timbral and timbral+temporal models are single-layered, these are able to achieve remarkable performances -showing that single-layered spectrogram front-ends (attending to musically relevant contexts) can do a reasonable job without paying the cost of going deep [36,39]. Thus meaning, e.g., that the saved capacity can now be employed by the back-end to learn (some more) representations.  The sample-level waveform model always performs better than frame-level many-shapes (t-test: p-value 0.05). The two best performing spectrogram-based models are: temporal and timbral+temporal, but the temporal model performs better than timbral+temporal in all cases (t-test: p-value 0.05) -denoting that spectral cues can be a confounding factor for this dataset. Moreover, the 7x86 model performs better than 7x96 in all cases (t-test: p-value<0.05). The best (average) accuracy score is obtained using temporal models and SVMs (≈3500): 89.82%. Note that the temporal model clearly outperforms the MFCCs baseline: 63.25% (t-test: p-value 0.05) and, interestingly, it performs slightly worse than a trained CNN: 93.7% [20]. This result confirms that the architectures (alone) introduce a strong prior which can significantly affect the performance of an audio model. Thus meaning that, besides learning, designing effective architectures might be key for resolving (music) audio tasks with deep learning. In line with that, note that the temporal architecture is designed considering musical domain knowledge -in this case: how tempo & rhythm are expressed in spectrograms. Hence, its good performance also validates the design strategy of using musically motivated architectures as a way to intuitively navigate through the network parameters space [36,38,39].

Urban Sounds 8K: acoustic event detection
For these experiments we do not use the temporal model (with 1D-CNNs of length 165, 128, 64, 32). Instead, we study the temporal+time model -where time follows the same design as temporal but with filters of length: 64, 32, 16, 8. This change is motivated by the fact that temporal cues in (natural) sounds are shorter and less important than temporal cues in music (i.e., rhythm or tempo).  The sample-level waveform model always performs better than frame-level many-shapes (t-test: p-value 0.05). The two best performing spectrogram-based models are: VGG and timbral+time -but VGG performs better than timbral+time in all cases (t-test: p-value 0.05). Also, the 7x86 model performs better than 7x96 in all cases (t-test: p-value<0.075). The best (average) accuracy score is obtained using VGG and SVMs (≈3500): 70.74% -outperforming the MFCCs baseline: 65.49% (t-test: p-value<0.05), and performing slightly worse than a trained CNN: 73% 13 [28]. Finally, note that VGGs achieved remarkable results when recognizing genres and detecting acoustic events -tasks where timbre is an important cue. As a result: one could argue that VGGs are good at representing spectral features. Hence, these might be of utility for tasks where spectral cues are relevant.

CONCLUSIONS
This study builds on top of prior works showing that the (classification) performance delivered by random CNN features is correlated with the results of their end-to-end trained counterparts [41,43]. We use this property to run a comprehensive evaluation of current deep architectures for (music) audio. Our method is as follows: first, we extract a feature vector from the embeddings of a randomly weighted CNN; and then, we input these features to a classifier -which can be an SVM or an ELM. Our goal is to compare the obtained classification accuracies when using different CNN architectures. The results we obtain are far from random, since: (i) randomly weighted CNNs are (in some cases) close to match the accuracies obtained by trained CNNs; and (ii) these are able to outperform MFCCs. This result denotes that the architectures alone are an important piece of the deep learning solution and therefore, searching for efficient architectures capable to encode the specificities of (music) audio signals might help advancing the state of our field. In line with that, we have shown that (musical) priors embedded in the structure of the model can facilitate capturing useful (temporal) cues for classifying rhythm/tempo classes. Besides, we show that for waveform front-ends: sample-level frame-level many-shapes > frame-level -as noted in the (trained) literature [27,52,53]. The differential aspect of the samplelevel front-end is that its representational power is constructed via hierarchically combining small-context representations, not by exploiting prior knowledge about waveforms. Further, we show that for spectrogram front-ends: 7x96<7x86 -as shown in prior (trained) works [31,36]. By allowing the filters to convolve through the frequency axis, the architecture itself facilitates capturing pitchinvariant timbral representations. Finally: timbral (+temporal/time) and VGG spectrogram front-ends achieve remarkable results for tasks where timbre is importantas previously noted in the (trained) literature [39]. Their respective advantages being that: (i) timbral (+temporal/time) architectures are single-layered front-ends which explicitly capture acoustically relevant receptive fieldswhich can be known via exploiting prior knowledge about the task; and (ii) VGG front-ends require no prior domain knowledge about the task for its design. Although our main conclusions are backed by additional results in the (trained) literature, we leave for future work consolidating those via doing a similar study considering trained models.

ACKNOWLEDGMENTS
This work was partially supported by the Maria de Maeztu Units of Excellence Programme (MDM-2015-0502) -and we are grateful for the GPUs donated by NVidia.