World’s Fastest Talking AI: Deepgram + Groq

Use Blip to create an article or summary from any YouTube video.

In the world of artificial intelligence, the race is on to create the fastest and most efficient conversational AI. I recently had the opportunity to team up with a group of engineers from Deepgram to test the limits of their new text-to-speech model. Our goal was to see just how low we could go in terms of latency.

To build our conversational AI, we needed three main components: a speech-to-text model, a language model, and a text-to-speech model.

For the speech-to-text model, we used Deepgram's latest offering, the Nova 2. This model is not only fast, but also highly accurate. Deepgram also offers a range of other Nova models that are tailored to different scenarios, such as drive-thru apps or phone calls. One of the standout features of the Nova models is their ability to do end-pointing, which is when the model recognizes a natural break in the conversation and sets a flag to indicate that the speaker has finished talking.

Next, we used the Grock API for our language model. Grock is a model provider that focuses on serving models rather than creating them. They have developed custom chips, called LPUs, that are specifically designed to speed up open-source models. While Grock does not make its own models, it serves them quickly and efficiently.

Finally, we used Deepgram's Aura streaming model for the text-to-speech component. Deepgram has been in the transcription game for a long time, and they have access to a vast amount of audio data. With the Aura streaming model, they are now able to train their own models to go from text to speech, rather than just speech to text. This allows for real-time processing and streaming of the text-to-speech data.

When we put all of these components together, we were able to create a conversational AI that is not only fast, but also highly accurate. The latency was incredibly low, and the text-to-speech component was able to process and stream the data in real-time.

In short, by combining the fastest speech-to-text, language, and text-to-speech models, we were able to create a conversational AI that is truly state-of-the-art. The potential applications for this technology are vast, and I am excited to see where it will go from here.