We introduce Seed-TTS, a series of large-scale autoregressive text-to-speech (TTS) models that can generate speech almost indistinguishable from human speech.
Seed-TTS, as a foundational model for voice generation, excels in learning speech context, with its performance in speaker similarity and naturalness matching real human speech in both objective and subjective evaluations.
Through fine-tuning, we have achieved higher subjective scores on these metrics. Seed-TTS exhibits excellent control over various speech attributes such as emotion, enabling the generation of highly expressive and diverse speech for natural speakers.
Additionally, we propose an auto-distillation method for voice decomposition, as well as a reinforcement learning approach to enhance the model's robustness, speaker similarity, and controllability.
We also introduce a non-autoregressive (NAR) variant of the Seed-TTS model, called Seed-TTS DiT, which employs a fully diffusion-based architecture. Unlike previous NAR-based TTS systems, Seed-TTS DiT does not rely on pre-estimated phoneme durations but performs speech generation through end-to-end processing.
We demonstrate that this variant achieves comparable performance to language model-based variants in both objective and subjective evaluations and showcase its effectiveness in speech editing.