OpenAI has released three audio models to its API: GPT-Realtime-2, GPT-Realtime-Translate, and GPT-Realtime-Whisper. GPT-Realtime-2 is the first voice model built on GPT-5-class reasoning, meaning it can handle more complex requests and sustain coherent conversations in real time. The other two models handle live translation across 70-plus input languages into 13 output languages, and streaming speech-to-text transcription respectively.
The significance here is architectural. These are not post-processed audio tools. All three models operate in real time, which eliminates the latency bottleneck that has made voice AI feel clunky in production environments. GPT-5-class reasoning inside a voice model is the detail worth sitting with: that capability tier is now accessible through a single API call.
The full announcement is worth reading for the API specifics, model naming conventions, and what developer use cases OpenAI is targeting first. The translation model's 70-plus language input ceiling and 13-language output ceiling will tell you a lot about where the gaps still are.
[WATCH ON YOUTUBE →]