Natural ≠ Preferred: What Our TTS Rankings Revealed About How Humans Actually Judge AI Voices

Description

When ranking AI voice models through human evaluation, "naturalness" and "preference" diverge more than most developers expect. Podonos's US English benchmark found Deepgram Aura 2 at #1 for naturalness, and a completely different model, Rime Arcana v3, at #1 for preference. Here's what that gap means.

Natural ≠ Preferred: What Our TTS Rankings Revealed About How Humans Actually Judge AI Voices

There is a question almost every developer asks when evaluating a text-to-speech model: which one sounds most like a real human?

It is the right question to ask. But it turns out it is not the only question that matters, and when you separate the two, the rankings shift in ways that should change how teams select models.

At Podonos, we run large-scale pairwise human evaluations on major TTS APIs. In our latest US English benchmark (11 models, 478 audio groups, 3,500 head-to-head comparisons), we asked evaluators two separate questions:

Naturalness: "Between the two audios, which one sounds more like it was spoken by a real human voice actor?"
Preference: "Among the two audios, which one do you prefer in general?"

These feel like the same question. They are not.

Deepgram Aura 2 ranked #1 on naturalness. Rime Arcana v3 ranked #1 on preference.

Two different questions. Two different winners. The gap between those two rankings is one of the most underappreciated insights in voice AI evaluation today.

Two Questions, Two Different Winners
Why Do Naturalness and Preference Diverge?
The Psychology Behind the Gap
What Existing Benchmarks Miss
Why This Gap Changes How You Should Pick a TTS Model
Use Case Matters More Than Rankings
The Broader Problem with Single-Score Rankings
FAQ
Conclusion

Two Questions, Two Different Winners

When you ask someone "which voice sounds more like a real human?", they activate a particular mental model. They listen for imperfections: the subtle breath before a sentence, slight variation in vowel length, the soft glottal stop at the start of a word. Human speech has texture. It has micro-variations in pace, pitch, and energy that most TTS models still struggle to replicate convincingly.

Deepgram Aura 2 topped this dimension. Evaluators consistently picked it as the voice that sounded most like a real human voice actor, across 3,500 pairwise comparisons on US English audio.

When you ask "which voice do you prefer?", something different happens. Preference is not just about authenticity. It also captures pleasantness, clarity, pacing consistency, and the absence of anything distracting. A voice can sound very human and still be unpleasant to listen to. And a voice that does not quite nail human imperfection can still be the one people choose to hear again.

Rime Arcana v3 topped this dimension. Not because it sounded less human, but because it produced a listening experience that evaluators found more enjoyable, more comfortable, and more consistent.

This is not a measurement artifact. It reflects something real about how humans process and respond to synthetic speech, with direct implications for how teams evaluate and select TTS models.

Why Do Naturalness and Preference Diverge?

The short answer: because "natural" and "good" are not the same thing.

Human speech is full of characteristics that we recognize as human precisely because they are imperfect. Filler-adjacent pacing, irregular pitch, breaths in unexpected places, subtle nasalization at certain phonemes, slight variation in energy across a sentence. A TTS model that captures these characteristics accurately scores high on naturalness. Listeners identify it as sounding human.

But those same characteristics can irritate.

A voice that breathes audibly on every sentence may feel human, but it also feels intrusive in a narration context. A voice with expressive pitch variation sounds natural in conversation, but may feel dramatically overwrought when reading a straightforward product description. A voice with the micro-variations of natural speech may score high on naturalness for a short listening test, then score lower on preference over extended use simply because the irregularities compound over time.

Deepgram Aura 2's naturalness advantage likely reflects exactly this. It sounds human enough that evaluators clock it as the more authentic voice. But "more authentic" does not automatically translate into "more enjoyable to listen to." Rime Arcana v3's preference advantage suggests it hits a different target: a voice that is consistent, pleasant, and free of the qualities that generate friction in repeated listening.

The Psychology Behind the Gap

There is a well-documented phenomenon in human perception research called the uncanny valley, originally described in robotics but applicable to voice. As synthetic voices approach but do not quite reach human-level realism, some listeners experience subtle discomfort. The voice is almost human, but something is slightly off, and that slight wrongness registers more strongly than a clearly synthetic voice would.

A voice that ranks very high on naturalness occupies a zone where this effect can appear. It is close enough to human that listeners notice when it deviates from expectation. A voice that ranks slightly lower on naturalness, but is smooth, consistent, and confident, may produce stronger preference scores because it never triggers that near-miss reaction.

There is also a cognitive load dimension. Naturalness often correlates with variation: variable pacing, pitch shifts, expressive emphasis. Variation requires more cognitive processing. In short audio samples, variation is engaging. Over longer content, it becomes fatiguing. Evaluators who rate naturalness in a short pairwise test may later prefer a simpler, more predictable voice for extended listening.

A 2026 IEEE paper by Shirali-Shahreza and Penn, published in the IEEE Open Journal of Signal Processing, decomposes naturalness into eight distinct listener-defined dimensions: prosodic variation, fluency, intelligibility, articulatory clarity, emotional appropriateness, pacing, voice quality, and consistency. The finding is that "naturalness" is not a single construct. It is a composite, and different components of that composite pull preference in different directions.

Deepgram Aura 2 may excel on dimensions like prosodic variation and articulatory clarity. Rime Arcana v3 may excel on consistency and emotional appropriateness. The same composite label, "natural," can hide very different profiles.

What Existing Benchmarks Miss

The dominant public TTS ranking mechanism, the Artificial Analysis Speech Arena, uses blind pairwise comparisons with an Elo rating system. Evaluators choose which of two samples sounds "better," without specifying what "better" means. This produces a single ranking that collapses naturalness and preference (and expressiveness, and clarity) into one undifferentiated signal.

The result is a leaderboard that answers the question "which voice do people generally pick in a short pairwise test?", not "which voice is most natural?" or "which voice is most preferred for a given context?"

This matters because the top five models on the Artificial Analysis leaderboard currently sit within 50 Elo points of each other. At that level of quality convergence, the question of "which is best?" depends entirely on what you weight, and the Elo score gives you no way to understand that weighting.

Asking naturalness and preference as separate questions surfaces a dimension that single-score benchmarks cannot. When a model ranks #1 on naturalness and not on preference, as Deepgram Aura 2 does here, that tells a developer something specific: the voice is maximally convincing as a human voice, but there are qualities in it that reduce listener enjoyment relative to a top-ranked preference model.

When a model ranks #1 on preference and not on naturalness, as Rime Arcana v3 does here, that tells a different story: the voice may not be the most realistic, but it is consistently pleasant and produces less listener friction. Which of these profiles fits your deployment better depends entirely on your use case.

Why This Gap Changes How You Should Pick a TTS Model

Most TTS model selection decisions today follow the same logic: check the leaderboard, pick the top-ranked model, run a quick demo, ship it.

The naturalness-versus-preference gap exposes why this logic fails in practice.

If your use case is voice agent or interactive telephony: Preference matters far more than naturalness. Users in a conversation with a voice agent care about responsiveness, clarity, and consistent pacing. A voice that sounds maximally human but has high intonation variability creates unpredictability that degrades the interaction. Rime Arcana v3's preference-first profile is a better fit for this context than a model optimized for naturalness alone.

If your use case is audiobook narration or long-form content: Both matter, but the balance shifts as listening duration increases. A naturalness score measured on a 10-second clip does not predict how listeners will respond after 45 minutes. Preference scores, which correlate with reduced listener fatigue, are a better proxy for extended content performance.

If your use case is short-form content (social media, ads, UI audio): Naturalness plays a larger role. Listeners form a single, brief impression. A voice that sounds distinctively human registers faster and makes a stronger impact in a short window. Deepgram Aura 2's naturalness-first profile has a real advantage here.

If your use case is multilingual or accented speech: Neither the naturalness score nor the preference score from a US English evaluation transfers to other languages. Both metrics require separate evaluation in the target language, with evaluators who are native or near-native speakers.

The takeaway: the correct metric for your TTS selection depends on your use case, listener duration, and interaction model. No single score, whether Elo, naturalness MOS, or preference win rate, can answer that question for you.

Use Case Matters More Than Rankings

The naturalness-versus-preference gap is really a symptom of a larger problem: the assumption that TTS quality is a single, context-independent property.

It is not. A voice that is excellent for a 30-second product ad may be exhausting over a 3-hour audiobook. A voice that performs well in a quiet listening test may struggle in a noisy environment where users are listening on phone speakers. A voice that earns high preference scores from native English speakers may register differently with non-native listeners who process prosody differently.

This is why asking naturalness and preference as separate questions gives practitioners more actionable data than a single combined score. Two dimensions are still a simplification of the full picture, but they are a better simplification. They force you to think about what you are actually optimizing for, rather than delegating that decision to an aggregate rank.

For teams building production voice pipelines, the practical implication is that model selection should start with use-case definition, not with the leaderboard. Define your listener profile. Define your content format. Define your listening duration and environment. Then identify which metric is the better predictor of your actual outcome: naturalness, preference, or a combination. Run evaluations against that metric on your specific content, not on generic benchmark prompts.

The Broader Problem with Single-Score Rankings

The field's reliance on single-score TTS rankings reflects a broader challenge: evaluation is harder than generation.

Building a TTS model that produces convincing speech is now a solved problem at the top tier. The gap between the best and the fifth-best model on the Artificial Analysis leaderboard is smaller than the gap between a great voice and the right voice for a specific use case. The industry has optimized heavily for a narrow benchmark metric, while the more important question of what makes a voice effective in a real deployment remains underspecified.

Research is starting to catch up. The ICLR 2026 paper SpeechJudge, from teams at ByteDance Research and CUHK, proposes a comprehensive suite for speech naturalness evaluation that goes beyond single-dimension MOS, combining human preference datasets, evaluation benchmarks, and generative reward models. The direction is clear: the field is moving toward multi-dimensional evaluation because single-score benchmarks have reached the limit of their usefulness.

Our naturalness-versus-preference split is a practical step in the same direction. It does not require building an entirely new evaluation framework. It requires asking two questions instead of one, and accepting that the answers will sometimes be inconvenient.

When Deepgram Aura 2 and Rime Arcana v3 land at the top of two different rankings from the same evaluation dataset, that inconvenience is the point.

FAQ

Why did different models win naturalness and preference in Podonos's evaluation?

Deepgram Aura 2 ranked #1 on naturalness ("which sounds more like a real human voice actor?") while Rime Arcana v3 ranked #1 on preference ("which do you prefer in general?"). The two dimensions measure different things. Naturalness captures authenticity and human-like imperfection. Preference captures overall listening enjoyment, which is influenced by consistency, pleasantness, and the absence of friction. A maximally authentic voice is not always the most enjoyable one.

Which TTS evaluation metric should developers prioritize?

It depends on the use case. Naturalness scores better predict listener response in short-form, impression-driven contexts like ads or social content. Preference scores better predict listener response in extended-use contexts like voice agents, audiobooks, or e-learning. Evaluating both separately on your specific content type gives more accurate signal than any single leaderboard rank.

Is the Elo score from the Artificial Analysis TTS Arena a measure of naturalness or preference?

Neither exclusively. The arena asks evaluators to pick which sample sounds "better" without defining what "better" means. The resulting Elo score reflects an undefined mix of naturalness, pleasantness, clarity, and expressiveness. It is a useful general-purpose signal, but it cannot distinguish between the dimensions that matter for specific use cases.

How does Podonos measure naturalness and preference separately?

We use pairwise side-by-side comparisons, presenting two audio samples from the same text to each evaluator. For naturalness, the question is: "Which sounds more like a real human voice actor?" For preference: "Which do you prefer in general?" Each question produces an independent ranking. Our US English evaluation covers 11 models, including OpenAI, ElevenLabs, Google Cloud, AWS, Deepgram, Cartesia, Rime, ResembleAI, Typecast, Naver, and Respeecher, across 478 audio groups and 3,500 total comparisons.

Does a high naturalness score mean a voice is better for voice cloning?

Not directly. Voice cloning quality measures how closely the synthesized voice matches a specific reference speaker, which is a different dimension from general naturalness. A model can produce highly natural-sounding speech in general while struggling to accurately clone specific speaker characteristics, particularly for emotional range or accent.

Why does Deepgram Aura 2 score high on naturalness but not top preference?

This is a hypothesis, not a confirmed finding, but the most likely explanation is the uncanny valley effect combined with listener fatigue. Aura 2's strong naturalness performance means it captures micro-variations and imperfections that mark human speech. In short pairwise comparisons, those qualities are engaging. Over time, or in contexts requiring consistent clarity, those same qualities may create friction. Rime Arcana v3's preference advantage suggests a voice profile that is smoother and more consistent, which listeners find easier to remain comfortable with across repeated exposure.

Conclusion

The finding from our evaluation is simple but consequential: natural and preferred are different things, and treating them as the same is one of the most common mistakes in TTS model selection.

Deepgram Aura 2 sounds most like a human. Rime Arcana v3 is the voice people prefer. Both are facts about the same dataset, and they point in different directions.

Most public benchmarks produce a single number that collapses these two dimensions. That number is useful, but it is not sufficient. A developer who selects a model based on a single leaderboard score makes a use-case assumption they may not have examined: that the conditions of a short pairwise listening test match the conditions of their actual deployment.

The right evaluation question is not "which model ranks highest?" but "which model ranks highest on the dimension that predicts success in my specific deployment context?"

Starting from that question changes which model you pick, which evaluation data you collect, and which quality signals you monitor after launch.

You can explore the full ranking data, including both naturalness and preference scores across all 11 models, at podonos.com/Podonos/ranking-en-us.

Data from Podonos's US English TTS evaluation: 11 models, 478 audio groups, 3,500 pairwise comparisons. Naturalness MOS decomposition referenced from Shirali-Shahreza & Penn, IEEE Open Journal of Signal Processing, 2026. SpeechJudge reference from Zhang et al., ICLR 2026.

Description