StyleTTS-ZS: Efficient High-Quality Zero-Shot Text-to-Speech Synthesis with Distilled Time-Varying Style Diffusion


Abstract

The rapid development of large-scale text-to-speech (TTS) models has led to significant advancements in modeling diverse speaker prosody and voices. However, these models often face issues such as slow inference speeds, reliance on complex pre-trained neural codec representations, and difficulties in achieving naturalness and high similarity to reference speakers. To address these challenges, this work introduces StyleTTS-ZS, an efficient zero-shot TTS model that leverages distilled time-varying style diffusion to capture diverse speaker identities and prosodies. We propose a novel approach that represents human speech using input text and fixed-length time-varying discrete style codes to capture diverse prosodic variations, trained adversarially with multi-modal discriminators. A diffusion model is then built to sample this time-varying style code for efficient latent diffusion. Using classifier-free guidance, StyleTTS-ZS achieves high similarity to the reference speaker in the style diffusion process. Furthermore, to expedite sampling, the style diffusion model is distilled with perceptual loss using only 10k samples, maintaining speech quality and similarity while reducing inference speed by 90%. Our model surpasses previous state-of-the-art large-scale zero-shot TTS models in both naturalness and similarity, offering a 10-20× faster sampling speed, making it an attractive alternative for efficient large-scale zero-shot TTS systems.

Prompt StyleTTS-ZS (LibriLight) VoiceCraft (GigaSpeech + LibriLight)

Mark Zuckerberg

SIM: 0.68 (time taken: 1.57s)

SIM: 0.71 (time taken: 14.26s)

Jeff Bezos

SIM: 0.60 (time taken: 2.21s)

SIM: 0.62 (time taken: 22.49s)

Kamala Harris

SIM: 0.63 (time taken: 1.34s)

SIM: 0.57 (time taken: 14.13s)

Benedict Cumberbatch

SIM: 0.61 (time taken: 1.44s)

SIM: 0.59 (time taken: 11.26s)

Sam Altman

SIM: 0.75 (time taken: 1.45s)

SIM: 0.75 (time taken: 11.06s)

Comparision with Small Scale Models

This section contain audio samples from our model trained on LibriTTS (LT) and LibriLight (LL) and all public models on Table 2. All speakers are unseen during training and samples were obtained from publicly avaiable checkpoints.

Text

Your play must be not merely a good play, but a successful one.

Whatever appealed to her sense of beauty was straightway transferred to paper or canvas.

But the general distinction is not on that account to be overlooked.

Federal judges and United States attorneys in Utah, who were not Mormons nor lovers of Mormonism, refused to entertain complaints or prosecute cases under the law because of its manifest injustice and inadequacy.

Why fades the lotus of the water?

"Nine thousand years have elapsed since she founded yours, and eight thousand since she founded ours," as our annals record.

The two young men, who were by this time full of the adventure, went down to the Wall Street office of Henry's uncle and had a talk with that wily operator.

"What is your country, Olaf? Have you always been a thrall?" The thrall's eyes flashed.

Prompt
StyleTTS-ZS (LT)
StyleTTS-ZS (LL)
HierSpeech++
StyleTTS 2
XTTS-v2
Ground Truth

* please scroll horizontally to see more samples (8 samples total).

Comparision with Large Scale Models I

This section contain audio samples from our model trained on LibriLight (LL) and part of models on Table 1. Since all samples (except VoiceCraft) are obatined from the public demo page of each corresponding model, we divide this section into three parts for the better layout. For the source of each model, please refer to Appendix C.1 for more information.

Text

For a few miles, she followed the line hitherto presumably occupied by the coast of Algeria, but no land appeared to the south.

I had always known him to be restless in his manner, but on this particular occasion he was in such a state of uncontrollable agitation that it was clear something very unusual had occurred.

His death in this conjuncture was a public misfortune.

It is this that is of interest to theory of knowledge.

Indeed, there were only one or two strangers who could be admitted among the sisters without producing the same result.

Their piety would be like their names, like their faces, like their clothes, and it was idle for him to tell himself that their humble and contrite hearts it might be paid a far-richer tribute of devotion than his had ever been. A gift tenfold more acceptable than his elaborate adoration.

The air and the earth are curiously mated and intermingled as if the one were the breath of the other.

For if he's anywhere on the farm, we can send for him in a minute.

Prompt
StyleTTS-ZS (LL)
NaturalSpeech 2
NaturalSpeech 3
FlashSpeech
CLaM-TTS
DiTTo-TTS
VoiceCraft
Ground Truth

* please scroll horizontally to see more samples (8 samples total).

Comparision with Large Scale Models II

This section contain audio samples from our model trained on LibriLight (LL) and part of models on Table 1. Since all samples (except VoiceCraft) are obatined from the public demo page of each corresponding model, we divide this section into three parts for the better layout. For the source of each model, please refer to Appendix C.1 for more information.

Text

The strong position held by the Edison system under the strenuous competition that was already springing up was enormously improved by the introduction of the three wire system and it gave an immediate impetus to incandescent lighting.

For, like as not, they must have thought him a prince when they saw his fine cap.

What you had best do, my child, is to keep it and pray to it that since it was a witness to your undoing, it will deign to vindicate your cause by its righteous judgment.

It is this that is of interest to theory of knowledge.

Yea, his honourable worship is within, but he hath a godly minister or two with him, and likewise a leech.

Instead of shoes, the old man wore boots with turnover tops, and his blue coat had wide cuffs of gold braid.

The army found the people in poverty, and left them in comparative wealth.

He was in deep converse with the clerk and entered the hall holding him by the arm.

Prompt
StyleTTS-ZS (LL)
NaturalSpeech 2
NaturalSpeech 3
FlashSpeech
MegaTTS 2
VoiceBox
CLaM-TTS
DiTTo-TTS
Vall-E
Ground Truth

* please scroll horizontally to see more samples (8 samples total).

Comparision with VoiceBox

This section contain audio samples from our model trained on LibriLight (LL) and VoiceBox samples from its official demo page.

Text

And the whole night the tree stood still and in deep thought

And lay me down in thy cold bed and leave my shining lot

Instead of shoes the old man wore boots with turnover tops and his blue coat had wide cuffs of gold braid

The army found the people in poverty and left them in comparative wealth

He was in deep converse with the clerk and entered the hall holding him by the arm

Number ten fresh nelly is waiting on you good night husband

How much wood could a woodchuck chuck if a woodchuck could chuck wood

When feline magicians enchant the city and crafty canine illusionists work to restore balance, don’t miss the uproarious clash in ‘magic and mischief: the paws of mystery.’

Prompt
StyleTTS-ZS (LL)
VoiceBox

* please scroll horizontally to see more samples (8 samples total).

Speech Editing

In this section, we compare our model's ability in speech editing with other models.

The following audio samples are obtained from VoiceCraft's demo page.

Text: Will find himself completely at a loss on occasions of common and constant recurrence rare and unpredictable circumstances, speculative ability is one thing and practical ability is another.

Original StyleTTS-ZS VoiceBox VoiceCraft

Text: And especially as I am not very much up in latin myself, he said, the suit was on an insurance policy a classified treasure map that he was defending on the ground of misinterpretations.

Original StyleTTS-ZS VoiceBox VoiceCraft

Text: In zero weather, in mid-winter, when the earth is frozen to a great depth below the surface jack frost has cast his icy spell upon the land, when in driving over the unpaved country roads, they give forth a hard metallic ring.

Original StyleTTS-ZS VoiceBox VoiceCraft

Text: This was george steers the son of a british naval captain and ship modeler who had become an american naval officer and was the first man to take charge of the washington navy yard entrusted with the prestigious role of overseeing the operations at the renowned naval headquarters.

Original StyleTTS-ZS VoiceBox VoiceCraft

Voice Conversion

The following audio samples are obtained from NaturalSpeech 2's demo page.

Text: The children the most blood-curdling ideas, to hate God, for instance.

Prompt Source StyleTTS-ZS NaturalSpeech 2

Text: Oh, Bartley, that old about my being hard!

Prompt Source StyleTTS-ZS NaturalSpeech 2