PDF

Description

As edge device applications begin to increasingly interact with users through speech, efficient automatic speech synthesis is becoming increasingly important. Typical text-to-speech pipelines include a vocoder, which translates intermediate audio representations into raw audio waveforms. Most existing vocoders are difficult to parallelize since each generated sample is conditioned on previous samples. Flow-based feed-forward models, for example, WaveGlow, is an alternative to these auto-regressive models. However, while WaveGlow can be easily parallelized, the model is too expensive for real-time speech synthesis on the edge. This work presents SqueezeWave, an extremely lightweight vocoder that can generate audio of similar quality to WaveGlow with 61x - 214x fewer MACs.

Details

Files

Statistics

from
to
Export
Download Full History