Cloning Voices Over Coffee: A First Look at Qwen3-TTS

You know that moment when you hear a voice on the radio, or a podcast, or even your own voicemail greeting, and think — I wish I could bottle that up? Well, Qwen3-TTS basically lets you do exactly that. With about three seconds of audio.

Yeah. Three seconds.

What Is Qwen3-TTS, Anyway?

Qwen3-TTS is an open-source text-to-speech model from the Qwen team that does something quietly remarkable: it takes a short clip of someone’s voice, builds a profile out of it, and then speaks as that person. Not in a creepy deepfake way — more in a “hey, I need a consistent narrator voice for my project and I happen to like this one” kind of way.

The model comes in a few flavors — the 1.7B Base model handles voice cloning, the CustomVoice variant offers built-in premium timbres with instruction control, and the VoiceDesign model lets you describe a voice from scratch using plain English. It supports 10 major languages (Chinese, English, Japanese, Korean, German, French, Russian, Portuguese, Spanish, and Italian), which is a pretty generous spread.

Under the hood, it uses a novel 12Hz tokenizer that compresses speech into a compact, information-rich representation. That’s a fancy way of saying: it’s efficient, it’s fast, and it doesn’t lose the subtle stuff — the breathiness, the warmth, the little inflections that make a voice sound like a person and not a GPS navigator from 2009.

The Demo: Voice Cloning in Two Clicks

While playing around with the demo locally, I was surprised how it cloned my own voice, with just 3 seconds of recorded audio. So I decided to make a demo.

I used 20 seconds audio from the movie “Resident Evil: Extinction” (2007) — specifically, the character Alice, played by Milla Jovovich. I cut her intro monologue, unprocessed with all the music and effects in the background, and here is Alice reading the intro of this article.

Making the demo

For the live demo, I used the Quen3-TTS repo, created a simple python http server with a few endpoints, changed the defaults so it would use CPU and not GPU (since I am hosting it on a VPS without a GPU).

I made a sveltekit website that would serve as the frontend, and as a proxy between the user and the python server. with session and local storage for persistence. even tho there are no login/signup, the sveltekit uses an http only cookie to persist data so no user data is leaked to other users.

I would write a full article on how I made the demo if there is any interest in the technical details.

The Demo: Try It Out

The Demo is live at https://qwen3tts-demo.rachidboudjelida.com/, Try it out. it’s on a small VPS right now, so expect slow audio generation if there are many users.