I recently had the opportunity to experiment with an exciting new API combination for voice chat inference. I combined the power of Groq and Deepgram to create what I believe is the fastest voice chat inference possible. In a previous video, I demonstrated a system that utilized OpenAI services for the same purpose. However, I wanted to see if I could improve the speed and efficiency of the system.
I started by using Whisper on the Groq API to transcribe audio to text. The Groq team was kind enough to give me early access to their API, and I was impressed by how quickly it was able to convert audio to text. It was almost three times faster than the OpenAI API.
Next, I used Llama 3 8 billion, also running on the Groq API, to generate a response to the text input from the user. Finally, I used the Deepgram text-to-speech converter through their API to generate the final audio.
However, I encountered a problem when using Deepgram. It was unable to keep up with the clock speed, so I had to implement a few hacks to get it to work properly.
To give you an idea of just how fast Whisper on the Groq API is, I conducted a speed test using the transcription of a 30-minute audio file from the Spring Update video. The OpenAI API took 67 seconds to complete the transcription, while the Groq API took only 24 seconds.
For the text-to-audio conversion, I used Deepgram, a service that provides text-to-speech and speech-to-text. They offer 200 credits when you sign up, so you can use their services for free.
I'm still working on perfecting the system, but I wanted to share my progress with you. I'll be making the code available on GitHub once I've cleaned it up and put everything together. It will be open source and available for free.
One thing to keep in mind if you're using Groq Whisper to transcribe your audio is that there is a rate limit. However, since it's currently free, it's hard to complain about the rate limits. Deepgram is a paid service, but you get 200 credits when you sign up, which is a great deal.
Overall, I'm excited about the potential of this new API combination for voice chat inference. I'm looking forward to continuing to experiment with it and seeing what other possibilities it opens up.