Articulatory-based Audiovisual Speech Synthesis: Proof of Concept for European Portuguese
by Samuel Silva, António Teixeira, Verónica Orvalho
Audiovisual speech synthesis, i.e., the synthesis of both the auditory and visual modalities of speech, presents several advantages over audio speech synthesis regarding robustness, e.g., allowing improved perception in noisy environments and providing a more natural interface between humans and machines.
While different approaches to audiovisual speech synthesis exist, current research seems to privilege data-driven methods enabling, e.g., concatenative synthesis. Despite that these approaches provide high quality, they depend on the acquisition of speaker data, which can be complex and time consuming.
The authors argue that an articulatory-based approach to audiovisual speech synthesis might provide a conceptually simple solution, more versatile than relying in prerecorded speaker data. Furhtermore, it could also serve as a research tool for studying the different aspects of audiovisual synthesis focusing on understanding speech production.
This article presents first results regarding an articulatory-based audiovisual speech synthesizer for European Portuguese that considers a computational model of articulatory phonology to drive the animation of a 3D avatar.
Illustrative videos: Approach 1: "O papá está no trabalho"
Approach 1: "Está no papo do papá"
Approach 2: "O papá está no trabalho"
Approach 2: "A pipoca está no papo"