A real-time music detection method based on convolutional neural network using Mel-spectrogram and spectral flux



Audio processing, including speech enhancement system, improves speech intelligibility and quality in real-time communication (RTC) such as online meetings and online education. However, such processing, primarily noise suppression and automatic gain control, is harmful to music quality when the captured signal is music instead of speech. A music detector can solve the issue above by switching off the speech processing when the music is detected. In RTC scenarios, the music detector should be low-complexity and cover various situations, including different types of music, background noises, and other acoustical environments. In this paper, a real-time music detection method with low-computation complexity is proposed, based on a convolutional neural network (CNN) using Mel-spectrogram and spectral flux as input features. The proposed method achieves overall 90.63% accuracy under different music types (classical music, instruments solos, singing-songs, etc.), speech languages (English and Mandarin), and noise types. The proposed method is constructed on a lightweight CNN model with a small feature size, which guarantees real-time processing.