Jean-Louis Queguiner, the founder of Gladio, which works with AI deployment, announced the release of Audio transcription alpha. Built on OpenAI’s Whisper-Large-v2, the speech-to-text API is able to transcribe a 1h file in 10s with a Word Error Rate as low as 1%. It is believed to be more accurate than other products in the market by at least 5 times. The company believes that this would open up the immense scope in the audio intelligence space and broaden future applications in AI with plug-and-play APIs.
Whisper is a pre-trained model for Audio Speech Recognition (ASR). These models have been trained on 680k hours of data. It was proposed by Alec Radford from OpenAI. The large-v2 model is trained for 2.5 times more epochs for improved efficiency. Whisper generates human-readable transcriptions, which means that the ASR system will be able to output commas, periods, hyphens and other punctuation marks. This will result in high-quality transcriptions resulting in a low Word Error Rate (WER).
Integrating the latest NLP and deep learning research, the API for alpha is built on neural network optimization, which has resulted in improved inference speed by around 60 times compared to other similar providers in the market. Gladio is currently working on 250 models to create a “holistic intelligence solution” which can perform more than 45 tasks, including translation, summaries, gender detection and sentiment analysis.
Inference speed is another parameter that is considered. The baseline was established by comparing the inference speed of other STT providers. At 16KHz sampling rate and 16 bits encoding, alpha was able to score 1 hour of Audio in both mono and stereo configuration, and this was compared with the results of other models that can deliver the same task within the same parameters.
The company also believes that “democratizing access” to AI should not only be cost-centric. It should be about simplifying the complexity of the tools used.