Weekly Updates - May 11th 2026
Weekly Voice and Video AI Product and Platform news
🗞️ Market and Product News
ElevenLabs surpassed $500M in AAR. Series D expands past $550M as BlackRock, NVIDIA, Jamie Foxx, and 30+ other investors join. ARR crossed $500M in Q1 2026, up from $350M at year-end 2025 — $100M net-new ARR in a single quarter. Valuation now at $11B.
Retell AI passed $60M in AAR. 60M ARR with a 35-person team, processing more calls per second than the U.S. 911 system making it one of the fastest trajectories in voice AI infrastructure.
Ethos raises $33M for an expert network onboarding through Voice AI interviews. a16z leads a $22.75M Series A for an expert network that onboards 35,000 people per week through voice AI interviews. The company is on track for eight-figure ARR, charging 30%+ per project to hedge funds, PE firms, and AI labs.
Greenhouse acquires Ezra AI Labs to embed voice AI interviewing into its ATS. The trigger: applications per recruiter on Greenhouse have spiked 412% since 2023. Ezra generates structured, role-specific interview scores and transcripts with full explainability.
Mahindra deploys voice agents with ElevenLabs to scale outreach for SUV launch. Mahindra deployed ElevenLabs voice agents for the XUV 7XO launch to manage peak demand — achieving higher contact rates and ~8% conversion uplift during the campaign.
Wispr Flow bets India is its fastest-growing market. The voice dictation app bets India is its fastest-growing market, adding Hinglish support and hitting 2.5M global downloads, with India growing at 100% month-over-month. India is 14% of installs but only 2% of revenue — the monetization gap is the real story.
🧰 Platform News
OpenAI releases three Voice models. GPT-Realtime-2 with GPT-5-class reasoning and a 128K context window, GPT-Realtime-Translate for live speech translation across 70+ input languages into 13 output languages, and GPT-Realtime-Whisper for streaming STT. The Realtime API exits beta and goes generally available.
Inworld releases Realtime TTS-2. A closed-loop voice model that takes prior audio turns as input — not just transcripts — to read the user’s actual tone and pacing, then adapts delivery mid-conversation. Voice Direction lets developers steer output in plain English. Sub-200ms first-chunk latency, 100+ languages, ranked #1 on Artificial Analysis Speech Arena.
Pocket TTS now supports six languages. Kyutai’s 100M-parameter open-source TTS model goes multilingual: French, Spanish, Portuguese, Italian, German, plus an improved English model — all running real-time on CPU without a GPU. Single open-source release.
Krisp introduces VIVA 2.0. Turn Prediction v3 detects 47% more true turn-shifts within 200ms vs. v2 across 12+ languages. New Interrupt Prediction model distinguishes real interruptions from backchannel feedback (”yeah”, “okay”) with under 6% false positives. CPU-only, no transcription required, ~15ms added latency.
Twilio launches a new Conversation Layer. At SIGNAL 2026, Twilio launched Conversation Memory, Orchestrator, and Intelligence — plus open-source Agent Connect for plugging any AI provider into Twilio voice and messaging channels. Memory builds a living, identity-resolved customer profile that persists across voice, SMS, WhatsApp, and chat.
📖 Reading
Vapi Voice Agent Playbook. 32 chapters distilled from 300M+ calls. Covers voice agent design, deployment, and scaling to production. Practical, opinionated, and grounded in real call data.32 chapters distilled from 300M+ calls. Covers voice agent design, deployment, and scaling to production. Practical, opinionated, and grounded in real call data.
OpenAI WebRTC Infrastructure Playbook. OpenAI published its WebRTC architecture details: a split relay + transceiver design handling 900M+ weekly active users at 300–500ms latency. The relay layer is stateless; the transceiver service owns stateful ICE and DTLS sessions.
OpenAI’s WebRTC Problem. Provocative post in response from the previous one from OpenAI arguing WebRTC is the wrong protocol for voice AI: it aggressively drops audio packets to minimize latency, the opposite of what you want when a 200ms wait beats a dropped prompt.
[YouTube] AssemblyAI CEO Dylan Fox on Skywatch. Dylan Fox discusses background noise handling, speaker identification, and what he calls the “intelligent listening layer” — understanding not just what was said but how and in what context. Useful for anyone thinking about the full audio intelligence stack.
TTS Models for Indian Languages: The Tech Giving Bharat a Voice. Developer survey covering Hindi, Tamil, Bengali, and Telugu TTS models with architecture comparisons and demo links. Good reference for anyone building voice AI for South Asian markets.
