Evaluation Design & Review

Two ways to design an evaluation
Podonos templates
Custom evaluation review
What review catches

Two ways to design an evaluation

Use a Podonos template

Pick from a library of pre-built, science-backed evaluation templates. Question wording, scale, anchors, and instructions are already calibrated.

Bring your own design

Write a custom evaluation. Our team reviews and proposes edits before it goes live.

Podonos templates

We maintain a library of evaluation templates that map to common research and product questions. Each template carries:

Calibrated question wording. No ambiguous “naturalness” — the wording is the version that has produced the most consistent inter-evaluator agreement in our internal validation.
Scale + anchors. The right number of points for the question, with concrete example audios at each level.
Pre-set attention checks appropriate to the evaluation type.
Default evaluator count and assignment policy tuned to the typical confidence interval customers want.

Available templates include:

Naturalness (NMOS)

Five-point Likert mean opinion score with anchored examples.

Voice similarity (SMOS)

Compare a generated voice to a reference voice.

Speech quality (P.808)

ITU-T P.808 protocol for telecom-grade quality assessment.

Preferences (PREF)

Two-way A/B preference between models.

Ranking

N-way ranking with adaptive pairing for fixed-budget global rank.

Comparative similarity

Pairwise similarity to a reference under common-target conditions.

Custom evaluation review

If a template does not fit your question, you can design your own — and we will review it before launch.

You draft

Write your question, scale, instructions, and anchors. Submit through the Workspace or your Slack channel.

We review

A Podonos evaluation specialist reads the draft for the failure modes we see most often: ambiguous wording, scale mismatch (too many points, too few), missing anchors, leading questions, and instructions that bury critical context.

We propose edits

You receive concrete proposed wording with the reasoning behind each change. You can accept, reject, or iterate.

Launch

Once you approve, the evaluation goes live with the same per-session quality controls as any template-based evaluation.

Most custom evaluations need at least one round of edits. The most common fix is replacing a vague target term (“natural”, “expressive”, “good quality”) with a behaviorally concrete prompt that anchors to a real-world reference.

What review catches

Ambiguous target terms

Words like “natural,” “expressive,” “good,” or “high quality” mean different things to different evaluators. We replace them with concrete behavioral prompts.

Scale mismatch

Five-point Likert is right for many tasks but wrong for others. Binary preferences should not have a 1–5 scale; subtle quality gradations need more than 3 points.

Missing anchors

Every scale point needs a concrete audio example. Without anchors, scores drift and inter-evaluator agreement collapses.

Leading questions

Questions phrased to bias the answer (“how clearly does this voice articulate?” implies clarity is a feature). We rewrite to neutral framing.

Buried instructions

Critical context placed at the end of a 500-word instruction page is invisible. We surface it.

Bias Minimization Recommendations

⌘I

Overview

How We Do It

Best Practices

Evaluation Design & Review

Two ways to design an evaluation

Use a Podonos template

Bring your own design

Podonos templates

Naturalness (NMOS)

Voice similarity (SMOS)

Speech quality (P.808)

Preferences (PREF)

Ranking

Comparative similarity

Custom evaluation review

What review catches

Overview

How We Do It

Best Practices

Documentation Index

​Two ways to design an evaluation

Use a Podonos template

Bring your own design

​Podonos templates

Naturalness (NMOS)

Voice similarity (SMOS)

Speech quality (P.808)

Preferences (PREF)

Ranking

Comparative similarity

​Custom evaluation review

​What review catches

Two ways to design an evaluation

Podonos templates

Custom evaluation review

What review catches