Voicing Silent Speech

Gaddy, David

PDF

Description

This thesis concerns the task of turning silently mouthed words into audible speech. By using sensors that measure electrical signals from muscle movement (electromyography or EMG), it is possible to capture articulatory information from the face and neck that pertains to speech. Using these signals, we aim to train a machine learning model to generate audio in the original speaker's voice that corresponds to words that were silently mouthed. We call this task voicing silent speech.

Voicing silent speech has a wide array of potential real-world applications. For example, it could be used to allow phone or video conversations where other people around the person speaking can't hear anything they say, or it could be useful in some clinical applications for people who can't speak normally but still have use of most of their speech articulators.

There have been several papers in the past that have looked at the problem of converting EMG signals to speech. However, these prior EMG-to-speech works have focused on the artificial task of recovering audio from EMG that was recorded during normal vocalized speech. In this work, we will instead generate speech from recordings where no actual sound was produced. Models trained only on vocalized speech perform poorly when applied to silent speech due to signal differences between the two modes. Our work is the first to train a model on EMG from silent speech, allowing us to overcome these signal differences.

Training with EMG from silent speech is more challenging than with EMG from vocalized speech, because when training on vocalized EMG data we have time-aligned speech targets but when training on silent EMG data there is no simultaneous audio. Our solution is to adopt a target-transfer approach, where audio output targets are transferred from vocalized recordings to silent recordings of the same utterances. To do this cross-modal training, we need to account for the fact that the two recordings are not time-aligned, so a core component of our work concerns finding the best way to align the vocalized speech targets with the silent utterances.

To enable development on this task, we collect and release a dataset of nearly twenty hours of EMG speech recordings, nearly ten times larger than previous publicly available datasets. We then demonstrate a method for training a speech synthesis model on silent EMG and propose a range of other modeling improvements to make the synthesized outputs more intelligible. We validate our methods with both human and automatic metrics, demonstrating major improvements in intelligibility of generated outputs.

Details

Title

Voicing Silent Speech

Creator

Gaddy, David, Author

Published

EECS Department, University of California at Berkeley, Berkeley, California, 5/11/2022

Full Collection Name

Electrical Engineering & Computer Sciences Technical Reports

Type

Text

Format

technical reports

Extent

74 p

Language

eng

Usage Statement

Researchers may make free and open use of the UC Berkeley Library’s digitized public domain materials. However, some materials in our online collections may be protected by U.S. copyright law (Title 17, U.S.C.). Use or reproduction of materials protected by copyright beyond that allowed by fair use (Title 17, U.S.C. § 107) requires permission from the copyright owners. The use or reproduction of some materials may also be restricted by terms of University of California gift or purchase agreements, privacy and publicity rights, or trademark law. Responsibility for determining rights status and permissibility of any use or reproduction rests exclusively with the researcher. To learn more or make inquiries, please see our permissions policies (https://www.lib.berkeley.edu/about/permissions-policies).

Collection

EECS Technical Reports

Files

Statistics

Download Full History

Download

Formats

Format
BibTeX
MARCXML
TextMARC
MARC
DublinCore
EndNote
NLM
RefWorks
RIS

Add to Basket