-
Abstract
Recent studies have been proposed to extract speech from the captured video of objects vibrating by
sound waves. Among them, from the viewpoint of equipment cost, the method of extracting speech
from the video captured by rolling-shutter cameras, which are widely used in consumer digital
single-lens reflex cameras, has been attracting attention. The conventional method with the rolling-
shutter video uses a grayscale video for processing based on phase images. However, a grayscale
video has a smaller dynamic range than an RGB video, and thus the speech extraction accuracy of
the conventional method degrades. Therefore, this paper proposes a speech extraction method based
on RGB-intensity gradients on an RGB video to improve speech extraction accuracy. The proposed
method extracts the speech by calculating the similarity of R, G, and B intensity gradients, and using
these three intensity gradients expands the dynamic range. The experimental results on the quality
and intelligibility of the extracted speech show our proposed method outperforms the conventional
method.