ElevenLabs Just Dropped v3 – Here’s What’s Actually New

ElevenLabs has launched v3 of their text-to-speech model in public alpha, and it’s already stirring up debate.

The company calls it their most expressive model to date, with support for over 70 languages, multi-speaker dialogue, and audio tags that let you shape tone and emotion mid-sentence.

It’s a big step forward on paper. But in practice? Things aren’t that simple.

Redditors testing the alpha version have plenty to say.

Some are blown away by the realism and fluidity. Others are frustrated by the loss of voice distinctiveness, limitations with pro voice cloning, and censorship filters that block swearing, even when used artistically.

And then there’s the pricing. It’s currently discounted by 80%, but the post-launch cost looks steep. Critics argue that open-source options like Dia from Nari Labs might catch up faster than expected, especially with ElevenLabs v3 essentially offering free training data during the alpha phase.

Still, there’s a reason creators keep trying it. When v3 gets it right, it really gets it right. The voices are dynamic, theatrical, and sometimes jaw-dropping. But with that power comes unpredictability.

Let’s break down what Eleven v3 brings to the table – and whether it’s worth the hype, the hassle, or the price tag.

Introducing Eleven v3

What Makes Eleven v3 Different?

Eleven v3 introduces a few standout features that push the limits of AI-generated speech. This isn’t just about sounding human – it’s about sounding nuanced.

  • 70+ Language Support: You can switch between languages without retraining voices. That opens the door to global storytelling, multilingual characters, or broader accessibility.

  • Audio Tags: These include tags like [happy], [whispering], and [sighs] to change tone on the fly. You can guide the emotion of a sentence or line by inserting these tags directly into the script.

  • Multi-Speaker Dialogue: It can simulate real conversations between multiple characters with interruptions, emotional shifts, and voice transitions. A single model handling this well is rare.

That said, most users are learning that it takes a lot of trial and error to unlock that magic.

The audio tags don’t always work as expected. Some users report that the model simply reads them out loud instead of interpreting them, especially with non-recommended voices.

ElevenLabs admits this isn’t production-ready yet. It’s an alpha. And like most alphas, the early experience depends on your patience, your prompts, and your expectations.

Not Everyone’s Impressed

While many Redditors were quick to praise the model, others voiced strong concerns, especially around censorship, quality consistency, and pricing.

  • Censorship Filters: The most vocal criticism came from users frustrated by content restrictions. The model tends to block or misinterpret prompts with vulgarity or “unsafe” words. That’s a dealbreaker for some creative writers, especially those working on raw, emotional scripts.

    “This is how AI dies,” one user wrote. “How can it be used for artistic pursuits when every company is overly sensitive about ‘safety.’”

  • Voice Cloning Limitations: Others noticed a drop in voice distinctiveness. Legacy PVCs (Professional Voice Clones) didn’t translate well to v3. Only a handful of ElevenLabs-recommended voices performed consistently, and even those weren’t perfect across all styles.

  • Inconsistent Tag Performance: Multiple users complained that audio tags rarely worked. One commenter claimed only one in every six tries actually delivered the tagged effect.

  • Language Quality Gaps: Japanese, Swiss German, and other non-English languages were called out for poor realism. One user compared the Japanese output to a tourist trying to imitate native speech – “nothing like a real speaker.”

These aren’t minor bugs. They’re fundamental experience issues, especially for creators depending on believable delivery in every line.

Pricing Sparks New Debate

Eleven v3 is available at an 80% discount – for now. But based on current usage rates, users estimate the full price will be 2 credits per character when it officially launches. That’s twice the cost of previous models.

  • During Alpha: 1 credit per 5 characters

  • Projected Launch Price: 1 credit per character

Some users didn’t mind the temporary deal. But others felt burned by how fast costs could stack up, especially when generations fail and require retries. The model often demands multiple regenerations to nail tone, style, or emotional accuracy.

“Even 80% off is too much for us to be your guinea pigs,” one frustrated user wrote.

Others are already looking elsewhere. One user mentioned holding out for Dia by Nari Labs, an open-source project that promises similar capabilities, with more control and no pricing surprises. They noted that Eleven v3’s current output could be used as training data for future open models.

If ElevenLabs isn’t careful, this pricing shift could push even loyal users toward competitors that prioritize affordability and transparency.

Real-Time Use? Not Yet

Despite the model’s expressive power, ElevenLabs made it clear: v3 isn’t ready for real-time use. If you’re building a live app, using v3 right now will probably slow you down.

Instead, they recommend sticking with:

  • v2.5 Turbo

  • Flash model

These older models are faster, more stable, and better suited for production workflows. Meanwhile, a real-time version of v3 is in development, and a public API is “coming soon.” No fixed timeline has been announced.

This limitation matters. Developers creating dynamic voice features – AI companions, game narration, live dubbing – can’t afford lag or unpredictability. For now, v3 is mostly a tool for prototyping, content creation, and testing what’s possible.

Some Reddit users say they’ll wait for full Studio integration and official support for pro voices. Others worry that existing voices won’t carry over well at all, requiring complete retraining once the final version is ready.

Not All Voices Are Treated Equally

One of the recurring complaints in the thread was that voice cloning quality dropped with v3, especially with user-generated or legacy voices.

  • Some found that older voices made with sliders and “dumb luck” still sounded fine.

  • Others said their custom PVCs now came out warped or unnatural.

  • A few joked that the voices defaulted to sounding like “southern belles” or “sultry vixens” regardless of the input.

It seems ElevenLabs prioritized a small set of recommended voices during the alpha rollout. Voices not optimized for v3 may need retraining to work properly. But according to Discord reports, none of the current voices are trained on v3 yet. They just happen to perform better due to how the model interacts with them.

Training new voices on v3 will likely require serious compute power – something ElevenLabs probably won’t offer until the model is production-ready. That means users hoping to port over their legacy voices may have a long wait.

It’s Powerful, But Over-the-Top?

Even fans of v3 admitted the output can feel exaggerated – sometimes too dramatic for normal speech.

  • Some described it as sounding “over the top” or “unnaturally excited.”

  • One user joked that humans never speak to them with that much energy, so the tone felt fake.

Others disagreed. They pointed out that expressive TTS is supposed to sound dynamic. Flat voices can always be toned down, but lifeless delivery is hard to recover once baked in.

There’s also the showcase vs. utility debate. v3 demos look amazing, but using it in real workflows still requires lots of adjustment, especially for emotional realism.

And while the [happy] or [sad] tags are nice in theory, several testers found that v3 often just reads them out loud, breaking immersion. When it works, it’s impressive. But when it doesn’t, it ruins the moment completely.

As one user summed it up: “For showcase purposes, it’s great. But for storytelling? It’s still unpredictable.”

Open Source Pressure Is Mounting

A growing number of users say they’re ready to jump ship – not because Eleven v3 is bad, but because open-source is catching up fast.

Projects like Dia by Nari Labs are mentioned more than once. They promise similar expressiveness, full control over content, and no artificial filters. All they need is quality training data – and ironically, many say they can now use Eleven’s public outputs for that.

There’s also a deeper trust issue. ElevenLabs has built impressive tools, but users are growing wary of increasing censorship, rising costs, and unclear timelines for feature parity.

“All we need is good training data, which we can get from this v3 model now. So yay, I guess.”

It’s a strange situation: the better ElevenLabs gets, the more they may be fueling their future competition.

Final Thoughts

Eleven v3 is exciting, flawed, and deeply polarizing.

You get flashes of brilliance – expressive emotion, multilingual control, multi-speaker dialogue – but it’s all buried under inconsistent behavior, unpredictable voice quality, and filters that limit creative freedom.

Still, for creators willing to experiment and push through its quirks, v3 opens the door to AI-generated speech that actually feels alive.

Whether that’s worth the cost or the compromises is up to you.

Leave a Reply

Your email address will not be published. Required fields are marked *