All posts

OpenAI's Whisper: Reading the Fine Print

OpenAI's Whisper: Reading the Fine Print

TL;DR:

  • Deepgram offers a fully managed Whisper API that’s faster, more reliable, and cheaper than OpenAI's. 

  • But even with these improvements, Deepgram Nova is faster, cheaper, and more accurate than any Whisper model in the market (including our own).

  • Beyond performance and cost, Whisper models lack critical features and functionality that can impede successful productization

  • Deepgram Smart Formatting is now available and delivers superior entity formatting results to OpenAI Whisper

In September 2022, OpenAI released Whisper, its general-purpose, open source model for automatic speech recognition (ASR) and translation tasks. OpenAI researchers developed Whisper to study speech processing systems trained under large-scale weak supervision and, in Open AI’s own words, for “AI researchers studying robustness, generalization, capabilities, biases, and constraints of the current model.” In contrast, Deepgram has developed a Language AI platform that includes speech-to-text APIs that enable software developers to quickly build scalable, production-quality products with voice data.

OpenAI offers Whisper in five model sizes, ranging from 39 million to over 1.5 billion parameters. Larger models tend to provide higher accuracy at the tradeoff of increased processing time and compute cost; to increase the processing speed for larger models additional computing resources are required.

Whisper is an open source software package, and can be a great choice for hobbyists, researchers, and developers interested in creating product demos or prototypes or conducting technical research on AI speech recognition and translation. However, when it comes to building production systems at scale involving real-time processing of streaming voice data, there are a number of considerations that may make Whisper less suitable than commercially available ASR solutions. Some of its notable limitations include:

  • Whisper is slow and expensive

  • Only Large-v2 is available via API (Tiny, Base, Small, and Medium models are excluded)

  • No built-in diarization, word-level timestamps, or keyword detection

  • 25MB file size cap and a low limit on concurrent requests per minute

  • No support for transcription via hosted URLs or callbacks

  • Can’t be used for real-time transcription out of the box; no streaming support, batch processing for pre-recorded audio only

  • No model customization or ability to train Whisper on your own data to improve performance

  • Limited entity formatting

OpenAI Whisper’s Performance and Cost

In April, we announced Nova—the fastest, cheapest, and most accurate speech recognition model in the market today. In addition, we also released Deepgram Whisper Cloud, a fully managed Whisper API that supports all five open source models, and is faster, more reliable, and cheaper than OpenAI's.

Since launch, we’ve helped customers transcribe over 400 million seconds of audio with our new Whisper Cloud. We believe very strongly that if you’re going to use Whisper, you should certainly use our managed Whisper offering. It’s 20% more affordable (for Whisper Large), 3 times faster, and provides more accurate transcription results than what you’re currently able to get with OpenAI’s model.

But implementing Whisper, even a managed service offering like Deepgram’s, is not without its shortcomings. We conducted rigorous testing of Nova against its competitors on 60+ hours of human-annotated audio pulled from real-life situations, encompassing diverse audio lengths, speakers, accents, environments, and domains, to ensure a practical evaluation of its real-world performance.

Using these datasets, we calculated the Word Error Rate (WER)[1] of Nova and Deepgram’s Whisper models and compared it to OpenAI’s most accurate model (Whisper Large). The results show Deepgram’s Whisper Large model beats OpenAI’s in each domain, while Nova leads the pack by a considerable margin. Nova achieves an overall WER of 7.4% for the median files tested, representing a 45.2% lead over OpenAI Whisper Large, which had an overall WER of 13.5% (see Figure 1). If you’re going to use Whisper, use Deepgram Whisper Cloud. But if you need the most accurate model, use Deepgram Nova. 

Figure 1: The figure above compares the average Word Error Rate (WER) of our Nova and Whisper models with OpenAI’s Whisper Large model across three audio domains: video/media/podcast, meeting, and phone call. It uses a boxplot chart, which is a type of chart often used to visually show the distribution of numerical data and skewness. The chart displays the five-number summary of each dataset, including the minimum value, first quartile (median of the lower half), median, third quartile (median of the upper half), and maximum value.

The time it takes to generate a transcript can make or break your use case. Many applications require real-time performance, like a conversational AI voicebot or IVR system where an end user expects to hear responses in milliseconds. Transcription solutions with high latency simply won’t get the job done.

Deepgram offers batch processing of pre-recorded audio as well as real-time processing for streaming audio. OpenAI Whisper only offers batch processing for pre-recorded audio. If your use case relies on real-time speech processing, Whisper will not cover your use case unless you devote significant in-house engineering efforts to making its model available in real time.

In tests measuring the inference time of pre-recorded audio, Nova was 13 times faster than OpenAI Whisper and even 4 times faster than our highly optimized Deepgram Whisper Large model.

Figure 2: The median inference time per audio hour across Whisper model sizes.

OpenAI Whisper is available through OpenAI’s platform via an API providing on-demand access priced at $0.006 per minute. Our Deepgram Whisper "Large" model (OpenAI's large-v2) starts at only $0.0048/minute, making it ~20% more affordable than OpenAI's offering. But even better, our fastest and most accurate model, Nova, starts at just $0.0043 per minute, nearly 30% more affordable than OpenAI Whisper.

OpenAI Whisper’s Limited Features and Support

For an open source project, Whisper’s models are certainly better than other, more spartan open source models that came before it. Whisper goes beyond mere speech-to-text transcription, which is unique in the open source landscape, but Whisper does not offer features which haven’t already been made available in existing commercial solutions. Again, accuracy varies as a function of the model’s size, but in general, Whisper offers:

  • Punctuation

  • Numeral formatting

  • Profanity filtering (English only)

  • Voice activity detection

  • Language detection

  • Language translation

Other Factors to Consider:

  • As with any open source project, it is up to the user to host, develop, and maintain any solution that incorporates Whisper.

  • Given its limited feature set, users must be prepared to devote engineering and research resources to build and maintain additional functionality.

  • Whisper has a number of known failure modes that developers need to handle (e.g. hallucinations, issues with silent segments, repetition in the output, etc.). These errors can be catastrophic in certain ASR use cases, such as compliance, finance, healthcare, and legal services.

OpenAI does not offer ongoing support, integration assistance, or other help with getting the most out of Whisper. There are some user forums for Whisper, including the Community page on OpenAI’s website and the discussion section of Whisper’s Github page, but otherwise, users are mostly on their own if they run into issues or have questions. This alone can be a dealbreaker for using Whisper in production environments at scale.

Deepgram Smart Formatting

At Deepgram, we know that good formatting is key to providing our customers with the results they need. Audio data is often full of entities like phone numbers, dates, email addresses, and currency values. The best transcription solutions don’t just transcribe those accurately, they transcribe them in a way that’s easy to read, too. ASR users may want to search podcast transcripts for mentions of specific websites to determine if sponsorship agreements are being honored, perform downstream redaction on certain types of numbers for compliance reasons, or simply be able to read their transcripts at a glance without struggling over a lack of formatting.

Whisper has set a respectable bar in terms of formatted output with numbers, dates, addresses, and several other entities properly formatted as customers would expect. Formatting contributes greatly to the perception of accuracy, and even in cases where the WER is worse than alternative ASR solutions or the model hallucinates, the visual appeal of properly formatted entities can often be a powerful counterpunch. However, while Whisper is capable of formatting some basic numbers and alphanumerics, its formatting and punctuation are often inconsistent and don’t cover the range of real-world use cases. 

Deepgram is pleased to announce the release of major improvements to its Smart Formatting feature, capable of handling a wide array of entities with incredible accuracy and consistency.

“Deepgram’s speech recognition services are a crucial component of the Conversation AI Cloud that Outbound AI provides to the healthcare market. While many solutions ignore operational considerations such as entity extraction and recognition consistency, Deepgram puts this front and center in their offering, which is critical to the performance of our AI-powered virtual agents. The same holds true for Deepgram’s exceptional accuracy and extreme low latency. Having benchmarked the top 14 providers in this space, Deepgram was the clear and pragmatic choice for us.”

– Jonathan Wiggs, Co-Founder & CTO of Outbound AI

To use the updated version of this feature, you’ll need to be using our Nova general model. It’s included for free to hosted and on-prem customers—simply set tier=nova&smart_format=true to get best-in-class formatting. 

Smart Formatting provides support for the following entity types:

  • Dates

  • Times

  • Ordinal numbers

  • Cardinal numbers

  • Currency amounts

  • Account numbers

  • Tracking numbers

  • Phone numbers

  • Addresses

  • Percents

  • Emails

  • URLs

With this release, Nova customers will enjoy more accurate transcripts that are properly formatted for easy consumption and follow-up analysis, improving compliance, readability, and utility for downstream natural language understanding tasks across the range of intelligent voice applications our customers are creating. Whether you're using our product for customer service or data analysis, this feature can save you time and help you make better-informed decisions.

Our team has worked tirelessly to ensure that this update meets our high standards of accuracy and reliability. We're confident that you'll see a significant improvement in the formatting of your transcripts, and we can't wait for you to try it out for yourself.

To learn more, please visit our API Documentation, and you can immediately try out our models and features in our API Playground.


Footnotes:

  1. For OpenAI, the accuracy/WER analysis was performed on a distribution of files of shorter duration due to the file size limitation of the OpenAI API.

Newsletter

Get Deepgram news and product updates

If you have any feedback about this post, or anything else around Deepgram, we'd love to hear from you. Please let us know in our GitHub discussions .

More with these tags:

Share your feedback

Thank you! Can you tell us what you liked about it? (Optional)

Thank you. What could we have done better? (Optional)

We may also want to contact you with updates or questions related to your feedback and our product. If don't mind, you can optionally leave your email address along with your comments.

Thank you!

We appreciate your response.