Content matching for sound generating objects within a visual scene using a computer vision approach

Dan Turner*, Chris Pike, Damian Murphy

*Corresponding author for this work

Research output: Contribution to conferencePaperpeer-review


The increase in and demand for immersive audio content production and consumption, particularly in VR, is driving the need for tools to facilitate creation. Immersive productions place additional demands on sound design teams, specifically around the increased complexity of scenes, increased number of sound producing objects, and the need to spatialise sound in 360?. This paper presents an initial feasibility study for a methodology utilising visual object detection in order to detect, track, and match content for sound generating objects, in this case based on a simple 2D visual scene. Results show that while successful for a single moving object there are limitations within the current computer vision system used which causes complications for scenes with multiple objects. Results also show that the recommendation of candidate sound effect files is heavily dependent on the accuracy of the visual object detection system and the labelling of the audio repository used.

Original languageEnglish
Number of pages10
Publication statusPublished - 28 May 2020
Event148th Audio Engineering Society International Convention 2020 - Vienna, Virtual, Online, Austria
Duration: 2 Jun 20205 Jun 2020


Conference148th Audio Engineering Society International Convention 2020
CityVienna, Virtual, Online

Bibliographical note

Funding Information:
This project is support by an EPRSC iCASE PhD Studentship in partnership with BBC R&D

Publisher Copyright:
© 2020 148th Audio Engineering Society International Convention. All rights reserved.

Cite this