Weekly Updates - Jun 8th 2026

Weekly Voice and Video AI Product and Platform news

Jun 08, 2026

🗞️ Market and Product News

AethexAI raises $3M to build voice AI for Africa and the Middle East. Ex-Goldman and Meta founders built Kora models (300M–1.7B params) for Arabic, French, and English dialects; already handling 17K calls/day for debt collection, KYC, and telecoms.
LOT Polish Airlines deploys ElevenLabs voice agents for customer service. Poland’s flag carrier becomes one of the first major European airlines to run ElevenLabs-powered voice agents across customer interactions.

🧰 Platform News

NVIDIA Nemotron 3.5 ASR transcribes 40 languages locally under 100ms. NVIDIA released a new multilingual speech-to-text model today: Nemotron 3.5 ASR, ideal for voice agents. Available in both multilingual (Nemotron 3.5 ASR) and English-only (Nemotron 3 ASR) checkpoints. It’s the lowest-latency STT model we’ve tested. It’s also completely open source, fine-tunable, and you can host it on your own infrastructure.
NVIDIA's Nemotron 3.5 ASR Streaming Multilingual available for Apple Silicon. Model is now available through FluidAudio CoreML optimized for Apple Silicon so apps can run ~40-language real-time ASR entirely on device, no cloud required.
LiveKit ships C++ SDK 1.0.0 for robotics and embedded systems. Native C++ client for realtime audio, video, and data tracks. Runs on Linux, macOS, Windows; ARM targets include NVIDIA Jetson, Raspberry Pi, and Rockchip. Hardware encoder acceleration included; ROS2 bridge on the roadmap.
HappyRobot launches its own TTS model. Built for low-latency deployment with accurate pronunciation of numbers, codes, and alphanumerics — the edge cases that break generic TTS in freight and logistics voice agents.
Microsoft launches MAI-Voice-2: zero-shot voice cloning across 17 languages. Clones a voice from 5–60s of audio with no retraining; wins 72% in head-to-head preference tests vs MAI-Voice-1; code-switches in Hindi-English and Spanish-English. Available in Foundry, VSCode, and Dynamics 365 Contact Center.
Miso Labs open-sources Miso One: 8B TTS at 110ms with human-level emotional prosody. 8-billion-parameter open-source TTS model for highly expressive speech; 110ms latency; According to them “the most emotive voice model in the world”.
Rednote open-sources dots.tts: first fully continuous TTS pipeline (no codec). 2B-parameter end-to-end autoregressive TTS with no discrete tokens anywhere — continuous AudioVAE at 48kHz feeding a flow-matching acoustic head. 24 languages, zero-shot voice cloning, Apache 2.0.
Google Magenta RealTime 2: open-weights real-time music generation with text, audio, and MIDI. 230M and 2.4B model sizes; streams audio from text prompts or note input under 200ms. Apache 2.0; community PyTorch port with ZeroGPU demos appeared within hours of release.
Tencent RTC and Soniox partner for enterprise voice AI. Combines Soniox’s STT 60+ language accuracy with Tencent’s 3,200-node global network to deliver under 300ms voice AI latency in 200+ countries.

📖 Reading

Best TTS Providers 2026: Why Vendor Benchmarks Lie (Coval). Coval benchmarks ElevenLabs, Cartesia, OpenAI, Deepgram, and 10 others; finds latency is no longer the top differentiator — emotional control, multilingual depth, and cost now separate the leaders.
Best STT Providers 2026: Independent Benchmarks & How to Choose (Coval). Accuracy on clean English has plateaued; the 14-provider comparison finds 30× price spread across the market and end-of-turn detection speed as the new differentiator in voice agent workloads.
Barge-in and full-duplex: the architecture that makes voice agents feel human. Walks through the five-step interruption pipeline (VAD → intent classification → TTS cancel → LLM abort → re-listen) and explains why event-driven decoupled design is required to stay under 300ms.
Streaming speaker diarization: How to identify who’s speaking in real time (AssemblyAI). Streaming speaker diarization identifies who is speaking in real time with low-latency labels. Learn how it works and when to use it for live apps.
Building voice agent persistent memory with MongoDB Atlas Vector Search (LiveKit) LiveKit tutorial showing RAG + hybrid rankFusion recall for cross-session voice agent memory; user profile loads via vector search before the first word of the conversation.

📦 Releases

LiveKit Agents: v1.5.17. Adding more model options for LiveKit inference and many fixes and improvements.
Pipecat:No releases.
TEN Framework: No releases.

RealTime AI - Weekly Updates

Ready for more?