Skip to main content

Documentation Index

Fetch the complete documentation index at: https://podonos.com/docs/llms.txt

Use this file to discover all available pages before exploring further.

Two ways to design an evaluation

Use a Podonos template

Pick from a library of pre-built, science-backed evaluation templates. Question wording, scale, anchors, and instructions are already calibrated.

Bring your own design

Write a custom evaluation. Our team reviews and proposes edits before it goes live.

Podonos templates

We maintain a library of evaluation templates that map to common research and product questions. Each template carries:
  • Calibrated question wording. No ambiguous “naturalness” — the wording is the version that has produced the most consistent inter-evaluator agreement in our internal validation.
  • Scale + anchors. The right number of points for the question, with concrete example audios at each level.
  • Pre-set attention checks appropriate to the evaluation type.
  • Default evaluator count and assignment policy tuned to the typical confidence interval customers want.
Available templates include:

Naturalness (NMOS)

Five-point Likert mean opinion score with anchored examples.

Voice similarity (SMOS)

Compare a generated voice to a reference voice.

Speech quality (P.808)

ITU-T P.808 protocol for telecom-grade quality assessment.

Preferences (PREF)

Two-way A/B preference between models.

Ranking

N-way ranking with adaptive pairing for fixed-budget global rank.

Comparative similarity

Pairwise similarity to a reference under common-target conditions.

Custom evaluation review

If a template does not fit your question, you can design your own — and we will review it before launch.
1

You draft

Write your question, scale, instructions, and anchors. Submit through the Workspace or your Slack channel.
2

We review

A Podonos evaluation specialist reads the draft for the failure modes we see most often: ambiguous wording, scale mismatch (too many points, too few), missing anchors, leading questions, and instructions that bury critical context.
3

We propose edits

You receive concrete proposed wording with the reasoning behind each change. You can accept, reject, or iterate.
4

Launch

Once you approve, the evaluation goes live with the same per-session quality controls as any template-based evaluation.
Most custom evaluations need at least one round of edits. The most common fix is replacing a vague target term (“natural”, “expressive”, “good quality”) with a behaviorally concrete prompt that anchors to a real-world reference.

What review catches

Words like “natural,” “expressive,” “good,” or “high quality” mean different things to different evaluators. We replace them with concrete behavioral prompts.
Five-point Likert is right for many tasks but wrong for others. Binary preferences should not have a 1–5 scale; subtle quality gradations need more than 3 points.
Every scale point needs a concrete audio example. Without anchors, scores drift and inter-evaluator agreement collapses.
Questions phrased to bias the answer (“how clearly does this voice articulate?” implies clarity is a feature). We rewrite to neutral framing.
Critical context placed at the end of a 500-word instruction page is invisible. We surface it.