OpenAI

Whisper

Open-source speech recognition that transcribes audio across many languages.

FreeAudio & VoiceSpeech-to-textOpen source
Whisper logo

Whisper is OpenAI's speech recognition system, an open-source model that turns spoken audio into accurate text and can translate speech from dozens of languages into English. Released in 2022 and freely available to download and run, it became one of the most influential AI tools of its era precisely because it was open: anyone could use it, build on it, and embed it into their own products without permission or per-use fees, as long as they ran it themselves.

What made Whisper a breakthrough was robustness. It was trained on an enormous, messy mix of real-world audio (accented speech, background noise, technical jargon, many languages) that makes it hold up where older transcription systems fell apart. It transcribes reliably across noisy, accented, and specialized audio without any task-specific tuning, and it does so across 99 languages, automatically detecting which one it is hearing.

This guide covers everything that matters about Whisper in 2026: what it does, the languages and tasks it handles, how to use it whether you self-host the open model or call OpenAI's API, what it costs, how it compares to other transcription services, and the limitations to keep in mind. By the end you will know exactly when Whisper is the right tool.

Whisper transcription in action: an audio waveform on the left and the model's timestamped text output on the right, with the detected language labeled.

What Is Whisper?

Whisper is an automatic speech recognition (ASR) model built by OpenAI. You give it an audio file or stream of speech and it returns the words as text. Beyond plain transcription it can translate non-English speech directly into English text, automatically identify the spoken language, and detect when someone is speaking versus silence. It was trained on roughly 680,000 hours of multilingual audio gathered from the web, which is the source of its unusual resilience to real-world conditions.

Crucially, Whisper is open-source and free to self-host. OpenAI published the model weights, so developers and researchers can run it on their own hardware at no cost, fine-tune it, and integrate it into anything from a podcast tool to an accessibility app. That openness is the reason Whisper sits at the heart of countless transcription products today, and many tools you use are quietly powered by it.

For those who do not want to run it themselves, OpenAI also offers Whisper as a hosted API, billed per minute of audio, so you can get transcriptions with a simple API call and no infrastructure to manage. The two paths, self-hosting the open model or calling the managed API, cover both ends of the spectrum from full control to maximum convenience.

What Whisper Can Do

Whisper handles a focused but powerful set of speech tasks.

TaskWhat it does
TranscriptionConvert spoken audio into written text in the original language, with timestamps available.
TranslationTranslate non-English speech directly into English text in a single step.
Language identificationAutomatically detect which of 99 languages is being spoken, from the first seconds of audio.
Voice activity detectionDistinguish speech from silence and background noise to segment audio cleanly.

Language Support

Whisper supports 99 languages, transcribing audio in its source language or translating it into English. If you do not tell it the language, it detects it automatically from the first 30 seconds of audio, though if you already know the language, specifying it improves accuracy. This breadth makes Whisper a genuinely global tool, capable of handling multilingual content, accented speech, and code-switching far better than systems built around a handful of major languages.

How to Use Whisper

There are two main ways to put Whisper to work, depending on how much control and infrastructure you want.

1. Self-Host the Open Model

Because the model is open-source, you can download it and run it on your own machine or server for free. This gives you full control and privacy, since your audio never leaves your hardware, and there are no per-minute charges no matter how much you process. The trade-off is that you provide the compute; the larger, most accurate versions of the model benefit from a capable GPU. This path suits developers, researchers, and anyone with sensitive audio or high volumes.

2. Call OpenAI's Hosted API

If you would rather not manage infrastructure, OpenAI's API exposes Whisper as a simple endpoint billed per minute of audio. You send a file, you get back text, with no servers, no GPUs, and no setup. This is the fastest route for building transcription into an app, and it is what many smaller projects use. Developers already building with the ChatGPT ecosystem will find it a natural fit.

The two paths to using Whisper: self-hosting the open model locally for full control, or calling OpenAI's hosted API for zero-setup transcription.

Pricing

Whisper's pricing is unusually simple, and one of its biggest draws. Figures below are standard published rates; always confirm current pricing on the official site.

OptionRoughlyNotes
Self-hostedFreeRun the open-source model on your own hardware. No per-use fees, you provide the compute.
Hosted API (Whisper)~$0.006 / minuteOpenAI's managed endpoint, billed per minute of audio with no setup required.
Newer transcribe modelsfrom ~$0.003 / minuteOpenAI also offers newer, cheaper transcription models and a pricier real-time option for live audio.

The hosted API rate of roughly six-tenths of a cent per minute is inexpensive enough that an hour of audio costs about a third of a dollar. Note there are generally no volume discounts on the standard Whisper API, so the rate is the same whether you process one hour or thousands. For very high volumes, self-hosting the free open model often becomes the cheaper path despite the compute you supply.

How Whisper Compares

Whisper competes with cloud speech-to-text services from the major providers. Its distinguishing traits are accuracy on messy audio, language breadth, and the fact that it is open.

WhisperCloud STT services
Open sourceYes, free to self-host and fine-tuneNo, proprietary and API-only
Accuracy on noisy audioStrong, trained on messy real-world dataVaries by provider
Languages99, with auto-detectionMany, varies by provider
Cost modelFree self-hosted or ~$0.006/min APIPer-minute API pricing

On independent meeting benchmarks Whisper achieves a word error rate in the 5 to 6% range on English audio, holding its own against and often beating the big cloud transcription services. The clincher for many teams, though, is openness: you can run it privately, for free, at any scale, which no proprietary API allows.

Real-World Use Cases

Transcription and Captioning

The most common use is turning recorded audio into text (podcasts, interviews, lectures, and meetings) and generating captions and subtitles for video. Whisper's timestamps make it easy to align text with the source audio.

Accessibility

Whisper powers accessibility features like live captions and transcripts that make spoken content usable for people who are deaf or hard of hearing, and searchable for everyone.

Building Voice Features

Developers embed Whisper as the speech-to-text layer in voice assistants, dictation tools, call analytics, and any app that needs to understand spoken input, often pairing it with a language model to act on what was said.

Limitations to Keep in Mind

LimitationWhat to know
Compute for self-hostingRunning the larger, most accurate models locally needs a capable GPU; lighter models trade some accuracy for speed.
Not real-time out of the boxThe standard model transcribes recorded files; low-latency live transcription needs the real-time variant or extra engineering.
Can hallucinate textOn silence, noise, or unclear audio Whisper can occasionally invent words that were not spoken. Review output for critical use.
No built-in speaker labelsWhisper transcribes what is said but does not natively identify who said it; speaker diarization needs additional tooling.
Accuracy varies by languageIt is strongest on high-resource languages like English; accuracy can drop on less common languages and heavy accents.

Final Verdict

Whisper is the quiet backbone of modern transcription. Its accuracy on real-world, noisy, multilingual audio, support for 99 languages, and above all its open, free-to-self-host nature have made it the default speech-to-text engine for an enormous range of tools and projects. Whether you run it yourself for free or call OpenAI's cheap hosted API, it delivers reliable transcription with minimal fuss.

It is not a turnkey app with a polished interface, and it leaves real-time and speaker labeling to additional tooling, but as a speech recognition engine, Whisper is hard to beat on quality and value. Building a voice feature? Pair it with ChatGPT to act on transcripts, and browse more free AI tools to round out your stack.

Frequently asked questions

Is Whisper free?

Yes. Whisper is open-source and free to download and self-host on your own hardware, with no per-use fees. If you prefer not to manage infrastructure, OpenAI's hosted API offers it at roughly $0.006 per minute of audio.

How many languages does Whisper support?

Whisper supports 99 languages. It transcribes audio in the original language or translates non-English speech into English, and it automatically detects the spoken language from the first 30 seconds of audio if you do not specify it.

Can I run Whisper on my own computer?

Yes. Because the model is open-source, you can run it locally for free and keep your audio private. The larger, most accurate versions benefit from a capable GPU, while smaller models run on more modest hardware with some loss of accuracy.

How accurate is Whisper?

Very accurate. It achieves roughly a 5 to 6% word error rate on English audio and performs strongly on noisy, accented, and technical speech, often matching or beating major cloud transcription services on benchmarks. Accuracy is highest for high-resource languages like English.

Does Whisper work in real time?

The standard model is built for transcribing recorded audio files rather than live streams. Low-latency, real-time transcription requires OpenAI's real-time transcription option or additional engineering around the open model.

Who made Whisper?

Whisper was built by OpenAI, the company behind ChatGPT. It was released as an open-source automatic speech recognition system trained on about 680,000 hours of multilingual audio.

Community reviews

Your rating