Sesame AI recently open-sourced their Conversational Speech Model (CSM), but the release has sparked disappointment across the developer community. While the company previously showcased impressive interactive voice demos, many users are finding the released 1B parameter model significantly less capable than what was demonstrated.
A Stripped-Down Version of the Promised Technology
The open-sourced CSM is a speech generation model built on a Llama backbone with a smaller audio decoder that produces Mimi audio codes. While technically functional, community feedback indicates substantial limitations compared to Sesame's polished demo. Multiple commenters have described the release as a rug-pull, suggesting that Sesame has released a deliberately crippled version of their technology.
Turns out it was a rug-pull. They open-sourced a crippled version of sesame (1B), not the one they're using in actual demo.
The model requires a CUDA-compatible GPU and has been tested on CUDA 12.4 and 12.6, with Python 3.10 recommended. It can generate speech from text inputs and works best when provided with conversational context, but users report the quality and performance fall significantly short of expectations.
CSM Model Requirements
- CUDA-compatible GPU
- Tested on CUDA 12.4 and 12.6
- Python 3.10 recommended
- Access to Hugging Face models:
- Llama-3.2-1B
- CSM-1B
Community Reported Issues
- Significantly slower than commercial alternatives
- Lower quality output than demonstrated in Sesame's demos
- Not a complete solution (speech generation only)
- Requires additional components to build a full voice assistant
- Some implementations experience awkward pauses in speech output
Performance and Usability Concerns
Users attempting to implement the model have encountered significant issues. The generation process is reportedly very slow, and the output quality has been described as suboptimal by community members who have tested it. One user specifically referenced a GitHub issue (#80) where these limitations are being discussed in detail.
Some developers have created alternative implementations to improve accessibility, such as a Python library for Mac users. However, even these implementations have reported quirks like inserting multiple-seconds-long awkward pauses into the output.
Privacy and Practical Applications
Beyond performance issues, privacy concerns have emerged regarding Sesame's hosted solution. One user noted that Sesame's policy of recording and reviewing conversations makes their hosted service a complete non-starter, highlighting the potential value of a truly capable open-source alternative that could be self-hosted.
The community consensus appears to be that while open voice models represent an exciting opportunity to compete with proprietary solutions, this particular release fails to deliver on its promise. As one commenter noted, the gap between this base model and a polished, responsive voice assistant like those in Sesame's demos demonstrates that voice AI requires thinking in terms of complete systems rather than individual components.
The disappointment surrounding this release suggests there remains a significant opportunity for developers who can deliver a genuinely capable open-source voice model that matches the quality of proprietary alternatives. For now, the search continues for an open voice solution that truly delivers on the promise of natural, responsive voice interaction.