Speech extraction with RGB-intensity gradient on rolling-shutter video
Time: 8:20 am
Author: Tsubasa Yoshizawa
Abstract ID: 1753
Recent studies have been proposed to extract speech from the captured video of objects vibrating by sound waves. Among them, from the viewpoint of equipment cost, the method of extracting speech from the video captured by rolling-shutter cameras, which are widely used in consumer digital single-lens reflex cameras, has been attracting attention. The conventional method with the rolling- shutter video uses a grayscale video for processing based on phase images. However, a grayscale video has a smaller dynamic range than an RGB video, and thus the speech extraction accuracy of the conventional method degrades. Therefore, this paper proposes a speech extraction method based on RGB-intensity gradients on an RGB video to improve speech extraction accuracy. The proposed method extracts the speech by calculating the similarity of R, G, and B intensity gradients, and using these three intensity gradients expands the dynamic range. The experimental results on the quality and intelligibility of the extracted speech show our proposed method outperforms the conventional method.
ConvTasNet-based anomalous noise separation for intelligent noise monitoring
Time: 11:00 am
Author: Han Li
Abstract ID: 2035
Noise pollution has become a growing concern in public health. The availability of low-cost wireless acoustic sensor networks permits continuous monitoring of noise. However, real acoustic scenes are composed of irrelevant sources (anomalous noise) that overlap with monitored noise, causing biased evaluation and controversy. One classical scene is selected in our study. For road traffic noise assessment, other possible non-traffic noise (e.g., speech, thunder) should be excluded to obtain a reliable evaluation. Because anomalous noise is diverse, occasional, and unpredictable in real-life scenes, removing it from the mixture is a challenge. We explore a fully convolutional time-domain audio separation network (ConvTasNet) for arbitrary sound separation. ConvTasNet is trained by a large dataset, including environmental sounds, speech, and music over 150 hours. After training, the scale-invariant signal-to-distortion ratio (SI-SDR) is improved by 11.40 dB on average for an independent test dataset. ConvTasNet is next applied to anomalous noise separation of traffic noise scenes. We mix traffic noise and anomalous noise at random SNR between -10 dB to 0 dB. Separation is especially effective for salient and long-term anomalous noise, which smooth the overall sound pressure level curve over time. Results emphasize the importance of anomalous noise separation for reliable evaluation.
CNN-based multi-class multi-label classification of sound scenes in the context of wind turbine sound emission measurements
Time: 8:00 am
Author: Nils Poschadel
Abstract ID: 2205
Within the scope of the interdisciplinary project WEA-Akzeptanz, measurements of the sound emission of wind turbines were carried out at the Leibniz University Hannover. Due to the environment there are interfering components (e. g. traffic, birdsong, wind, rain, ...) in the recorded signals. Depending on the subsequent signal processing and analysis, it may be necessary to identify sections with the raw sound of a wind turbine, recordings with the purest possible background noise or even a specific combination of interfering noises. Due to the amount of data, a manual classification of the audio signals is usually not feasible and an automated classification becomes necessary. In this paper, we extend our previously proposed multi-class single-label classification model to a multi-class multi-label model, which reflects the real-world acoustic conditions around wind turbines more accurately and allows for finer-grained evaluations. We first provide a short overview of the data acquisition and the dataset. We then briefly summarize our previous approach, extend it to a multi-class multi-label formulation, and analyze the trained convolutional neural network regarding different metrics. All in all, the model delivers very reliable classification results with an overall example-based F1-score of about 80 % for a multi-label classification of 12 classes.
A basic study on estimating location of sound source by using distributed acoustic measurement network
Time: 8:00 am
Author: Itsuki Ikemi
Abstract ID: 2439
The sounds from childcare facilities are often a cause of noise problems with neighbors, however since the sound power levels of children's play and other sounds in child-care facilities have not become clear, evaluation methods have not been established, making countermeasures difficult. In order to evaluate the noise, it is necessary to model the location of the sound source and the sound power level. We have been developing a sound source identification system that uses multiple Raspberry Pi-based recording devices to estimate the location of a sound source and sound power levels. By using GPS for time synchronization, the system can be distributed and placed without connecting cables, which is expected to expand the measurement area significantly. As a method of estimation, the arrival time difference is calculated by cross-correlation from the signals input to each recording device, and the sound source location is estimated from the calculated arrival time difference and the location information of the device. The effectiveness of this system was verified in an anechoic room and outdoor fields.
Subjective hearing sensation of process variations at a milling machine. How reliable will chatter marks be detected?
Time: 8:40 am
Author: Florian Trautmann
Abstract ID: 2599
Intuition enables experienced machine operators to detect production errors and to identify their specific sources. A prominent example in machining are chatter marks caused by machining vibrations. The operators assessment, if the process runs stable or not, is not exclusively based on technical parameters such as rotation frequency, tool diameter, or the number of teeth. Because the human ear is a powerful feature extraction and classification device, this study investigates to what degree the hearing sensation influences the operators decision making. A steel machining process with a design of experiments (DOE)-based variation of process parameters was conducted on a milling machine. Microphone and acceleration sensors recorded machining vibrations and machine operators documented their hearing sensation via survey sheet. In order to obtain the optimal dataset for calculating various psychoacoustic characteristics, a principle component analysis was conducted. The subsequent correlation analysis of all sensor data and the operator information suggest that psychoacoustic characteristics such as tonality and loudness are very good indicators of the process quality perceived by the operator. The results support the application of psychoacoustic technology for machine and process monitoring.
A real-time music detection method based on convolutional neural network using Mel-spectrogram and spectral flux
Time: 7:40 am
Author: Yiya Hao
Abstract ID: 11599
Audio processing, including speech enhancement system, improves speech intelligibility and quality in real-time communication (RTC) such as online meetings and online education. However, such processing, primarily noise suppression and automatic gain control, is harmful to music quality when the captured signal is music instead of speech. A music detector can solve the issue above by switching off the speech processing when the music is detected. In RTC scenarios, the music detector should be low-complexity and cover various situations, including different types of music, background noises, and other acoustical environments. In this paper, a real-time music detection method with low-computation complexity is proposed, based on a convolutional neural network (CNN) using Mel-spectrogram and spectral flux as input features. The proposed method achieves overall 90.63% accuracy under different music types (classical music, instruments solos, singing-songs, etc.), speech languages (English and Mandarin), and noise types. The proposed method is constructed on a lightweight CNN model with a small feature size, which guarantees real-time processing.