ByteDance: Seed-TTS Voice Generation Model

ByteDance's voice generation model demonstrates considerable strength based on its demo performance. Seed-TTS is a series of large-scale autoregressive text-to-speech (TTS) models capable of generating speech that is virtually indistinguishable from human speech. As a foundational model for voice generation, Seed-TTS excels in learning speech context, with its performance in speaker similarity and naturalness being comparable to real human speech in both objective and subjective evaluations. We propose an auto-distillation method for voice decomposition, as well as a reinforcement learning method to enhance the model's robustness, speaker similarity, and controllability. We also introduce a non-autoregressive (NAR) version of the Seed-TTS model, named Seed-TTS DiT, which employs a fully diffusion-based architecture.

We introduce Seed-TTS, a series of large-scale autoregressive text-to-speech (TTS) models that can generate speech almost indistinguishable from human speech.

Seed-TTS, as a foundational model for voice generation, excels in learning speech context, with its performance in speaker similarity and naturalness matching real human speech in both objective and subjective evaluations.

Through fine-tuning, we have achieved higher subjective scores on these metrics. Seed-TTS exhibits excellent control over various speech attributes such as emotion, enabling the generation of highly expressive and diverse speech for natural speakers.

Additionally, we propose an auto-distillation method for voice decomposition, as well as a reinforcement learning approach to enhance the model's robustness, speaker similarity, and controllability.

We also introduce a non-autoregressive (NAR) variant of the Seed-TTS model, called Seed-TTS DiT, which employs a fully diffusion-based architecture. Unlike previous NAR-based TTS systems, Seed-TTS DiT does not rely on pre-estimated phoneme durations but performs speech generation through end-to-end processing.

We demonstrate that this variant achieves comparable performance to language model-based variants in both objective and subjective evaluations and showcase its effectiveness in speech editing.