I built an open-source, screen-free, storytelling toy for my nephew who uses a Yoto toy. My sister told me he talks to the stories sometimes and I thought it could be cool if he could actually talk to those characters in stories with AI models (STT, LLM, TTS) running locally on her Macbook and not send the conversation transcript to cloud models.
This is my voice AI stack:
- ESP32 on Arduino to interface with the Voice AI pipeline
- mlx-audio for STT (whisper) and TTS with streaming (`qwen3-tts` / `chatterbox-turbo`)
- mlx-vlm to use vision language models like Qwen3.5-9B and Mistral
- mlx-lm to use LLMs like Qwen3, Llama3.2, Gemma3
- Secure websockets to interface with a Macbook
This repo supports inference on Apple Silicon chips (M1/2/3/4/5) but I am planning to add Windows soon. Would love to hear your thoughts on the project.
Many parents are concerned about sending their children's chat transcripts to the cloud and privacy is often the first thing that comes up when we talk about AI toys.
So I built OpenToys so anyone with an ESP32 can create their own AI Toys that run inference locally, starting with Apple Silicon chips and keep their data from leaving their home network.
The repo currently supports voice cloning and multilingual conversations in 10 languages locally. The app is a Rust Tauri app with a Python sidecar with the voice pipeline. The stack uses Whisper for STT, any MLX LLMs, Qwen3-TTS and Chatterbox-Turbo for TTS.
Yes, you can. I was just testing it. I made a "My Custom Voices" tab, and recorded a small sample of my own voice or upload a sample of w/e voice. Then you can use it. I am in the process of training a model of my voice too to see how it handles it using the 1.7b
Works surprisingly good with a 4090. I will also try it on 5090. This is the best one I have seen so far. NGL. 11Labs is cooked lol.
I would highly recommend gemini 2.5 pro too for their speech quality. It's priced lower and the quality is top notch on their API. I made an implementation here in case you're interested https://www.github.com/akdeb/ElatoAI but its on hardware so maybe not totally relevant
I'm using LiveKit, and I indeed have tested Gemini, but it appears to be broken or at least incompatible with OpenAI. Not sure if this is a Livekit issue or a Gemini issue. Anyway I decided to go back to just using LLM, SST and TTS as separate nodes, but I've also been looking into Deepgram Voice Agent API, but LiveKit doesn't support it (yet?).
Its as if the rubber duck was actually on the desk while youre programming and if we have an MCP that can get live access to code it could give you realtime advice.
Wow, that's really cool thanks for open sourcing! I might dig into your MCP I've been meaning to learn how to do that.
I genuinely think this could be great for toys that kids grow up with i.e. the toy could adjust the way it talks depending on the kids age and remember key moments in their life - could be pretty magical for a kid