Documentation Index
Fetch the complete documentation index at: https://podonos.com/docs/llms.txt
Use this file to discover all available pages before exploring further.
Audio length
Upper bound: 30 seconds
Keep individual audio files at or under 30 seconds. Longer audio causes evaluator fatigue, divided attention, and inconsistent ratings within the same clip.
Lower bound: 1 second
Keep audio at or above 1 second. Shorter clips do not give evaluators enough signal to form a reliable judgment.
Podonos applies a length surcharge for audio beyond 15 seconds (in 5-second increments) to discourage overly long clips. Up to 15 seconds is included at the base rate.
Why these bounds matter
- Short clips (under ~1s) often do not contain enough phonemic context for an evaluator to form a stable opinion. Quality, naturalness, and similarity all degrade in measurability below this threshold.
- Long clips (over ~30s) split an evaluator’s attention. Within a 60-second clip, an evaluator’s impression of the first 10 seconds drifts before they have to commit to a single rating. The result is a noisier, less-reproducible score.
Rating instructions
The single biggest determinant of inter-evaluator agreement is the quality of the instructions. Two practices reliably improve agreement.Anchor every scale point with an example
Do not ship a 1–5 scale with only “1 = bad, 5 = excellent” labels. Provide concrete audio anchors:| Score | Label | Example anchor |
|---|---|---|
| 5 | Indistinguishable from human | A real human recording or your gold-standard reference |
| 4 | Very natural with minor artifacts | Your current production model on a clean script |
| 3 | Recognizably synthetic but acceptable | A mid-tier baseline render |
| 2 | Noticeable artifacts that distract from content | An older or under-trained model render |
| 1 | Clearly robotic or unintelligible | An old concatenative or low-quality vocoder render |
Use behaviorally concrete prompts
Replace abstract terms with concrete behavior the evaluator can listen for.Naturalness
Naturalness
Avoid: “How natural is this voice?”Prefer: “How much does this voice sound like it was spoken by a real human voice actor?”
Quality
Quality
Avoid: “Rate the quality of this audio.”Prefer: “Rate the audio for the absence of distortion, clipping, and background noise.”
Similarity
Similarity
Avoid: “Are these voices similar?”Prefer: “Do these two clips sound like the same speaker, regardless of what they are saying?”
Expressiveness
Expressiveness
Avoid: “Is the voice expressive?”Prefer: “Does the voice convey the emotion described in the script (e.g., excitement, calm, urgency)?”
Vote count and budget
- For naturalness and quality MOS studies, 10 votes per query is a good default for tight confidence intervals.
- For preference (A/B) studies, 15–20 votes per query is recommended when the expected effect size is small.
- For ranking, the Ranking evaluation type uses adaptive pairing — you set the budget and we minimize confidence intervals automatically.

