By the same authors

From the same journal

Masked Conditional Neural Networks for sound classification

Research output: Contribution to journalArticlepeer-review

Full text download(s)

Published copy (DOI)



Publication details

DateAccepted/In press - 9 Dec 2019
DateE-pub ahead of print - 17 Jan 2020
DatePublished (current) - May 2020
Number of pages17
Early online date17/01/20
Original languageEnglish


The remarkable success of deep convolutional neural networks in image-related applications has led to their adoption also for sound processing. Typically the input is a time–frequency representation such as a spectrogram, and in some cases this is treated as a two-dimensional image. However, spectrogram properties are very different to those of natural images. Instead of an object occupying a contiguous region in a natural image, frequencies of a sound are scattered about the frequency axis of a spectrogram in a pattern unique to that particular sound. Applying conventional convolution neural networks has therefore required extensive hand-tuning, and presented the need to find an architecture better suited to the time–frequency properties of audio. We introduce the ConditionaL Neural Network (CLNN)1 and its extension, the Masked ConditionaL Neural Network (MCLNN) designed to exploit the nature of sound in a time–frequency representation. The CLNN is, broadly speaking, linear across frequencies but non-linear across time: it conditions its inference at a particular time based on preceding and succeeding time slices, and the MCLNN use a controlled systematic sparseness that embeds a filterbank-like behavior within the network. Additionally, the MCLNN automates the concurrent exploration of several feature combinations analogous to hand-crafting the optimum combination of features for a recognition task. We have applied the MCLNN to the problem of music genre classification, and environmental sound recognition on several music (Ballroom, GTZAN, ISMIR2004, and Homburg), and environmental sound (Urbansound8K, ESC-10, and ESC-50) datasets. The classification accuracy of the MCLNN surpasses neural networks based architectures including state-of-the-art Convolutional Neural Networks and several hand-crafted attempts.

Bibliographical note

© 2020 The Authors

    Research areas

  • Restricted Boltzmann Machine; RBM; Conditional Restricted Boltzmann Machine; CRBM; Music Information Retrieval; MIR; Environmental Sound Recognition; ESR; Conditional Neural Networks; CLNN; Masked Conditional Neural Networks; MCLNN; Deep Neural Networks

Discover related content

Find related publications, people, projects, datasets and more using interactive charts.

View graph of relations