Preliminary testing on my part made me have goosebumps talking to this AI. So much so that I'd urge you to try it for yourself. I'm writing this article knowing that it's way better than the 'factual'-tone I'm letting off below.

If you’ve ever tried to have a meaningful conversation with ChatGPT’s voice mode or winced at Alexa’s robotic responses, you know today’s AI assistants still feel like talking to a spreadsheet. That slightly-off quality—stilted timing, monotone delivery, or awkward pauses—keeps them stranded in the uncanny valley of voice. But a new contender from startup Sesame, founded by Oculus co-creator Brendan Iribe, claims its conversational model can finally bridge the gap.

This isn’t just a parlor trick. Sesame’s approach hinges on what it calls “voice presence”: the blend of timing, tone, and contextual awareness that makes conversations feel alive. Unlike ChatGPT’s voice mode, which often requires users to wait several seconds for responses, Sesame claims latency as low as 320 milliseconds—matching the natural gap between human speakers. The model also handles overlapping dialogue gracefully, allowing interruptions mid-sentence.

The technical alchemy behind the banter

Under the hood, Sesame’s Conversational Speech Model (CSM) ditches the traditional two-step process (text generation followed by speech synthesis) for an end-to-end system. It processes both text and audio tokens simultaneously using a hybrid transformer architecture. The open-sourced model leverages a split-RVQ tokenizer to capture semantic context and acoustic details in one pass, sidestepping delays from cascading models.

Early benchmarks are promising. In tests comparing CSM to ElevenLabs and OpenAI’s voice models, Sesame’s system showed 80% accuracy on homograph disambiguation (e.g., pronouncing “lead” correctly in “metal lead” vs. “lead a team”)—a notorious pain point for text-to-speech engines. Subjective evaluations using the Expresso dataset found listeners couldn’t distinguish CSM-generated speech from human recordings when heard in isolation.

But context remains a hurdle. When evaluators judged audio clips with preceding conversational history, human samples still edged out CSM by a significant margin. Sesame’s team attributes this to the model’s current focus on prosody rather than narrative coherence. “It’s like an actor nailing a line reading but missing the scene’s emotional arc,” explains Ankit Kumar, Sesame’s CTO.

Sesame’s progress arrives as legacy voice assistants struggle to evolve. Amazon recently delayed its AI-powered Alexa overhaul due to hallucination risks, with engineers admitting even a 1% error rate could spell disaster at scale. Meanwhile, ChatGPT’s voice mode—while expressive—still can’t parse visual cues or maintain multi-turn conversations fluidly.

The startup’s secret sauce? Training on one million hours of diverse audio, emphasizing natural dialogue over scripted interactions. This dataset includes everything from podcast banter to customer service calls, allowing CSM to mimic the messy, interruptible flow of real-world chats.

Sesame plans to expand beyond English later this year, targeting 20+ languages by 2026. Future iterations aim to integrate visual inputs from its glasses, enabling real-time analysis of facial expressions and environments. But the biggest test will be scaling while preserving what makes CSM unique: its knack for laughter, hesitation, and the unscripted rhythms that make conversations human.

For now, the demos speak for themselves. Sesame’s AI avoids the uncanny valley’s pitfalls not through perfection, but through imperfections—a well-timed “um,” a self-deprecating joke, the subtle rasp of someone trying not to laugh. It’s a reminder that connection isn’t about flawless execution. Sometimes, it’s about knowing when to giggle at a terrible pun.

The link has been copied!