Speech extraction with RGB-intensity gradient on rolling-shutter video

-

Abstract

Recent studies have been proposed to extract speech from the captured video of objects vibrating by

sound waves. Among them, from the viewpoint of equipment cost, the method of extracting speech

from the video captured by rolling-shutter cameras, which are widely used in consumer digital

single-lens reflex cameras, has been attracting attention. The conventional method with the rolling-

shutter video uses a grayscale video for processing based on phase images. However, a grayscale

video has a smaller dynamic range than an RGB video, and thus the speech extraction accuracy of

the conventional method degrades. Therefore, this paper proposes a speech extraction method based

on RGB-intensity gradients on an RGB video to improve speech extraction accuracy. The proposed

method extracts the speech by calculating the similarity of R, G, and B intensity gradients, and using

these three intensity gradients expands the dynamic range. The experimental results on the quality

and intelligibility of the extracted speech show our proposed method outperforms the conventional

method.