Building AI Voice Applications in 2024

This article provides a detailed account of the author's experience and recommendations in building AI voice applications. It begins by highlighting the significant advancements in voice technology with the introduction of OpenAI's Whisper Speech To Text model and ChatGPT Voice, making it possible to build high-quality voice applications. The author shares their experience of failure in building a language learning app and how they improved it with new technologies. The article then discusses the core technological models of AI voice applications, including Speech-to-text, Large Language Model (LLM), and Text-to-speech, and how to choose the appropriate data transmission method (REST API, Websockets API, WebRTC). The author emphasizes the advantages of WebRTC in real-time voice processing and explains why Websockets should be avoided.

I. The Opportunity for AI Voice Application Development

Technological Breakthroughs
The introduction of tools such as Open AI's Whisper (a speech-to-text model) and ChatGPT Voice has made it possible to build high-quality voice applications. Previously, voice applications built with the Web Speech API were not very effective, but now there are better options available.

Cost Reduction
OpenAI's Text-to-Speech (TTS) service has made voice feedback economically feasible. For example, comparing the costs of different TTS models, OpenAI TTS offers a clear cost advantage, providing profit margins for commercial applications such as educational voice applications.

II. Core Elements of Building AI Voice Applications (From a Beginner's Perspective)

Core Models
These include Speech-to-text, Large Language Model (LLM) for intelligent processing, and Text-to-speech, forming a core loop: receiving audio, converting it to text, inputting it into the LLM to generate a response, and then inputting the text into the TTS model to return audio.

Choice of Communication Pipeline
REST API: Suitable for "push to talk" modes, similar to sending audio WhatsApp messages. This method has relatively low speed requirements and does not need to worry about data loss issues, but it is not suitable for voice applications with high real-time requirements.
Websockets API: Not recommended by the author. Although it can be used, compared to WebRTC, it requires manual management of data when network speeds are inconsistent, which can complicate the code and product, and audio issues may arise.
WebRTC: A better choice. It automatically manages inconsistent network speeds, has a mature client library ecosystem, and numerous infrastructure providers, making it easier to adopt new AI models in the future and facilitating the transition to voice-to-voice (Voice-to-Voice) models.

Voice Activity Detection (VAD)
It is crucial for applications to determine when users start or stop speaking. Initially, the author used a local scoring system based on audio levels and Apple's native TTS to create a coherent word speed, which worked but had issues. Later, VAD was performed on the server, using FFMPEG for data preprocessing, which increased code complexity but provided better results.

III. How to Build AI Voice Applications with Experience

Advantages of WebRTC and Related Providers
The advantages of WebRTC include adaptability to different network environments, a mature ecosystem, ease of future expansion and model migration, etc. Related infrastructure providers include Daily.co and LiveKit, among others.

Model Selection
Choose models based on project goals, such as selecting models that can stream for speed. Different providers have their characteristics, such as OpenAI's STT and TTS performing well in many aspects, and Whisper being suitable for multilingual scenarios, etc.

Developer Frameworks
Frameworks like Pipecat, which is similar to Langchain's role in LLM work, is loosely coupled with Daily.co and includes Silero Vad, interruption management, and multiple model providers. There are also frameworks like LiveKit Agents to choose from.

Hosting Services
For developers who want to quickly achieve a minimum viable product (MVP) from 0 and acquire paying customers with more general needs, hosting services like Vapi can be used. It can quickly implement features and is reasonably priced.