> ## Documentation Index
> Fetch the complete documentation index at: https://podonos.com/docs/llms.txt
> Use this file to discover all available pages before exploring further.

# Evaluation Design & Review

> Science-backed templates and a human review pass on every evaluation.

## Two ways to design an evaluation

<CardGroup cols={2}>
  <Card title="Use a Podonos template" icon="folder-open">
    Pick from a library of pre-built, science-backed evaluation templates. Question wording, scale, anchors, and instructions are already calibrated.
  </Card>

  <Card title="Bring your own design" icon="pen-to-square">
    Write a custom evaluation. Our team reviews and proposes edits before it goes live.
  </Card>
</CardGroup>

## Podonos templates

We maintain a library of evaluation templates that map to common research and product questions. Each template carries:

* **Calibrated question wording.** No ambiguous "naturalness" — the wording is the version that has produced the most consistent inter-evaluator agreement in our internal validation.
* **Scale + anchors.** The right number of points for the question, with concrete example audios at each level.
* **Pre-set attention checks** appropriate to the evaluation type.
* **Default evaluator count and assignment policy** tuned to the typical confidence interval customers want.

Available templates include:

<CardGroup cols={2}>
  <Card title="Naturalness (NMOS)" icon="ear" href="/naturalness">
    Five-point Likert mean opinion score with anchored examples.
  </Card>

  <Card title="Voice similarity (SMOS)" icon="microphone-stand" href="/similarity">
    Compare a generated voice to a reference voice.
  </Card>

  <Card title="Speech quality (P.808)" icon="signal-bars-good" href="/quality">
    ITU-T P.808 protocol for telecom-grade quality assessment.
  </Card>

  <Card title="Preferences (PREF)" icon="people" href="/preferences">
    Two-way A/B preference between models.
  </Card>

  <Card title="Ranking" icon="ranking-star" href="/ranking">
    N-way ranking with adaptive pairing for fixed-budget global rank.
  </Card>

  <Card title="Comparative similarity" icon="microscope" href="/comparative-similarity">
    Pairwise similarity to a reference under common-target conditions.
  </Card>
</CardGroup>

## Custom evaluation review

If a template does not fit your question, you can design your own — and we will review it before launch.

<Steps>
  <Step title="You draft">
    Write your question, scale, instructions, and anchors. Submit through the Workspace or your Slack channel.
  </Step>

  <Step title="We review">
    A Podonos evaluation specialist reads the draft for the failure modes we see most often: ambiguous wording, scale mismatch (too many points, too few), missing anchors, leading questions, and instructions that bury critical context.
  </Step>

  <Step title="We propose edits">
    You receive concrete proposed wording with the reasoning behind each change. You can accept, reject, or iterate.
  </Step>

  <Step title="Launch">
    Once you approve, the evaluation goes live with the same per-session quality controls as any template-based evaluation.
  </Step>
</Steps>

<Tip>
  Most custom evaluations need at least one round of edits. The most common fix is replacing a vague target term ("natural", "expressive", "good quality") with a behaviorally concrete prompt that anchors to a real-world reference.
</Tip>

## What review catches

<AccordionGroup>
  <Accordion title="Ambiguous target terms" icon="circle-question">
    Words like "natural," "expressive," "good," or "high quality" mean different things to different evaluators. We replace them with concrete behavioral prompts.
  </Accordion>

  <Accordion title="Scale mismatch" icon="ruler-horizontal">
    Five-point Likert is right for many tasks but wrong for others. Binary preferences should not have a 1–5 scale; subtle quality gradations need more than 3 points.
  </Accordion>

  <Accordion title="Missing anchors" icon="anchor">
    Every scale point needs a concrete audio example. Without anchors, scores drift and inter-evaluator agreement collapses.
  </Accordion>

  <Accordion title="Leading questions" icon="arrow-right-to-bracket">
    Questions phrased to bias the answer ("how clearly does this voice articulate?" implies clarity is a feature). We rewrite to neutral framing.
  </Accordion>

  <Accordion title="Buried instructions" icon="layer-group">
    Critical context placed at the end of a 500-word instruction page is invisible. We surface it.
  </Accordion>
</AccordionGroup>
