Go to main content

PDF

Description

Articulatory synthesis, the process of generating speech from physical movements of human articulators, offers unique advantages due to its physically grounded and compact input features. However, recent advancements in the field have prioritized audio quality without a focus on streaming latency. In this paper, we propose a real-time streaming differentiable digital signal processing (DDSP) articulatory vocoder that can synthesize speech from electromagnetic articulography (EMA), fundamental frequency (F0), and loudness data. Our best model achieves a transcription word error rate (WER) of 8.9%, which is 4.0% lower than a state-of-the-art baseline. The same model can also generate 5 milliseconds of speech in less than 2 milliseconds on CPU in a streaming fashion, opening the door for downstream real-time low-latency audio applications.

Details

Files

Statistics

from
to
Export
Download Full History
Formats
Format
BibTeX
MARCXML
TextMARC
MARC
DublinCore
EndNote
NLM
RefWorks
RIS