Skip to main content

Documentation Index

Fetch the complete documentation index at: https://podonos.com/docs/llms.txt

Use this file to discover all available pages before exploring further.

Audio length

Upper bound: 30 seconds

Keep individual audio files at or under 30 seconds. Longer audio causes evaluator fatigue, divided attention, and inconsistent ratings within the same clip.

Lower bound: 1 second

Keep audio at or above 1 second. Shorter clips do not give evaluators enough signal to form a reliable judgment.
Podonos applies a length surcharge for audio beyond 15 seconds (in 5-second increments) to discourage overly long clips. Up to 15 seconds is included at the base rate.

Why these bounds matter

  • Short clips (under ~1s) often do not contain enough phonemic context for an evaluator to form a stable opinion. Quality, naturalness, and similarity all degrade in measurability below this threshold.
  • Long clips (over ~30s) split an evaluator’s attention. Within a 60-second clip, an evaluator’s impression of the first 10 seconds drifts before they have to commit to a single rating. The result is a noisier, less-reproducible score.
If your use case genuinely requires longer audio (long-form narration, podcast generation, multi-turn dialog), reach out — we can structure the evaluation differently rather than scoring one long clip.

Rating instructions

The single biggest determinant of inter-evaluator agreement is the quality of the instructions. Two practices reliably improve agreement.

Anchor every scale point with an example

Do not ship a 1–5 scale with only “1 = bad, 5 = excellent” labels. Provide concrete audio anchors:
ScoreLabelExample anchor
5Indistinguishable from humanA real human recording or your gold-standard reference
4Very natural with minor artifactsYour current production model on a clean script
3Recognizably synthetic but acceptableA mid-tier baseline render
2Noticeable artifacts that distract from contentAn older or under-trained model render
1Clearly robotic or unintelligibleAn old concatenative or low-quality vocoder render
Each anchor should be a real audio clip the evaluator can listen to from the rating UI. See Bias Minimization → Anchored instructions.

Use behaviorally concrete prompts

Replace abstract terms with concrete behavior the evaluator can listen for.
Avoid: “How natural is this voice?”Prefer: “How much does this voice sound like it was spoken by a real human voice actor?”
Avoid: “Rate the quality of this audio.”Prefer: “Rate the audio for the absence of distortion, clipping, and background noise.”
Avoid: “Are these voices similar?”Prefer: “Do these two clips sound like the same speaker, regardless of what they are saying?”
Avoid: “Is the voice expressive?”Prefer: “Does the voice convey the emotion described in the script (e.g., excitement, calm, urgency)?”

Vote count and budget

  • For naturalness and quality MOS studies, 10 votes per query is a good default for tight confidence intervals.
  • For preference (A/B) studies, 15–20 votes per query is recommended when the expected effect size is small.
  • For ranking, the Ranking evaluation type uses adaptive pairing — you set the budget and we minimize confidence intervals automatically.

When in doubt, ask us

The cheapest mistake to fix is the one you fix before launching the evaluation. If you are unsure about question wording, audio length, vote count, or anchor selection, message us in your Slack channel before submitting. A short design review now will save a re-run later.