Author(s): Ting-Hsiang Huang | Jane Yung-Jen Hsu | Bo-Jyun Lu
Journal: International Journal of Electronic Business Management
ISSN 1728-2047
Volume: 5;
Issue: 1;
Start page: 11;
Date: 2007;
VIEW PDF
DOWNLOAD PDF
Original page
Keywords: Multi-Agent System | Photo Sharing | Metadata | Social Network | Recommendation
ABSTRACT
Visual mouth information has been proved to be very helpful for understanding the speech contents. The synthesis of the corresponding face video directly from the speech signals is strongly demanded since it can significantly reduce the amount of video information for transmission. In this paper, we present a novel statistical learning approach that learns the mappings from input voice signals to the corresponding mouth images. A deformable mouth template model is employed to parameterize the mouth shape corresponding to different transient speech signals followed by a radial basis function (RBF) interpolation technique to synthesize the mouth image according to a new set of predicted mouth shape parameters. The support vector regression (SVR) machine is used to learn the mapping from speech features to visemes, which are parameterized now by a set of mouth shape parameters. From the input speech signals, we can dynamically predict the mouth shapes through the trained SVRs and further synthesize realistic mouth images. Experimental results are shown to demonstrate the vivid speech-driven mouth image synthesis results by using the proposed algorithm.
Journal: International Journal of Electronic Business Management
ISSN 1728-2047
Volume: 5;
Issue: 1;
Start page: 11;
Date: 2007;
VIEW PDF


Keywords: Multi-Agent System | Photo Sharing | Metadata | Social Network | Recommendation
ABSTRACT
Visual mouth information has been proved to be very helpful for understanding the speech contents. The synthesis of the corresponding face video directly from the speech signals is strongly demanded since it can significantly reduce the amount of video information for transmission. In this paper, we present a novel statistical learning approach that learns the mappings from input voice signals to the corresponding mouth images. A deformable mouth template model is employed to parameterize the mouth shape corresponding to different transient speech signals followed by a radial basis function (RBF) interpolation technique to synthesize the mouth image according to a new set of predicted mouth shape parameters. The support vector regression (SVR) machine is used to learn the mapping from speech features to visemes, which are parameterized now by a set of mouth shape parameters. From the input speech signals, we can dynamically predict the mouth shapes through the trained SVRs and further synthesize realistic mouth images. Experimental results are shown to demonstrate the vivid speech-driven mouth image synthesis results by using the proposed algorithm.