Recommendations

Audio length

Upper bound: 30 seconds

Keep individual audio files at or under 30 seconds. Longer audio causes evaluator fatigue, divided attention, and inconsistent ratings within the same clip.

Lower bound: 1 second

Keep audio at or above 1 second. Shorter clips do not give evaluators enough signal to form a reliable judgment.

Podonos applies a length surcharge for audio beyond 15 seconds (in 5-second increments) to discourage overly long clips. Up to 15 seconds is included at the base rate.

Why these bounds matter

Short clips (under ~1s) often do not contain enough phonemic context for an evaluator to form a stable opinion. Quality, naturalness, and similarity all degrade in measurability below this threshold.
Long clips (over ~30s) split an evaluator’s attention. Within a 60-second clip, an evaluator’s impression of the first 10 seconds drifts before they have to commit to a single rating. The result is a noisier, less-reproducible score.

If your use case genuinely requires longer audio (long-form narration, podcast generation, multi-turn dialog), reach out — we can structure the evaluation differently rather than scoring one long clip.

Rating instructions

The single biggest determinant of inter-evaluator agreement is the quality of the instructions. Two practices reliably improve agreement.

Anchor every scale point with an example

Do not ship a 1–5 scale with only “1 = bad, 5 = excellent” labels. Provide concrete audio anchors:

Score	Label	Example anchor
5	Indistinguishable from human	A real human recording or your gold-standard reference
4	Very natural with minor artifacts	Your current production model on a clean script
3	Recognizably synthetic but acceptable	A mid-tier baseline render
2	Noticeable artifacts that distract from content	An older or under-trained model render
1	Clearly robotic or unintelligible	An old concatenative or low-quality vocoder render

Each anchor should be a real audio clip the evaluator can listen to from the rating UI. See Bias Minimization → Anchored instructions.

Use behaviorally concrete prompts

Replace abstract terms with concrete behavior the evaluator can listen for.

Naturalness

Avoid: “How natural is this voice?”Prefer: “How much does this voice sound like it was spoken by a real human voice actor?”

Quality

Avoid: “Rate the quality of this audio.”Prefer: “Rate the audio for the absence of distortion, clipping, and background noise.”

Similarity

Avoid: “Are these voices similar?”Prefer: “Do these two clips sound like the same speaker, regardless of what they are saying?”

Expressiveness

Avoid: “Is the voice expressive?”Prefer: “Does the voice convey the emotion described in the script (e.g., excitement, calm, urgency)?”

Vote count and budget

For naturalness and quality MOS studies, 10 votes per query is a good default for tight confidence intervals.
For preference (A/B) studies, 15–20 votes per query is recommended when the expected effect size is small.
For ranking, the Ranking evaluation type uses adaptive pairing — you set the budget and we minimize confidence intervals automatically.

When in doubt, ask us

The cheapest mistake to fix is the one you fix before launching the evaluation. If you are unsure about question wording, audio length, vote count, or anchor selection, message us in your Slack channel before submitting. A short design review now will save a re-run later.

Overview

How We Do It

Best Practices

Recommendations

Audio length

Upper bound: 30 seconds

Lower bound: 1 second

Why these bounds matter

Rating instructions

Anchor every scale point with an example

Use behaviorally concrete prompts

Vote count and budget

When in doubt, ask us

Overview

How We Do It

Best Practices

Documentation Index

​Audio length

Upper bound: 30 seconds

Lower bound: 1 second

​Why these bounds matter

​Rating instructions

​Anchor every scale point with an example

​Use behaviorally concrete prompts

​Vote count and budget

​When in doubt, ask us

Audio length

Why these bounds matter

Rating instructions

Anchor every scale point with an example

Use behaviorally concrete prompts

Vote count and budget

When in doubt, ask us