> ## Documentation Index > Fetch the complete documentation index at: https://podonos.com/docs/llms.txt > Use this file to discover all available pages before exploring further. # Evaluation Design & Review > Science-backed templates and a human review pass on every evaluation. ## Two ways to design an evaluation Pick from a library of pre-built, science-backed evaluation templates. Question wording, scale, anchors, and instructions are already calibrated. Write a custom evaluation. Our team reviews and proposes edits before it goes live. ## Podonos templates We maintain a library of evaluation templates that map to common research and product questions. Each template carries: * **Calibrated question wording.** No ambiguous "naturalness" — the wording is the version that has produced the most consistent inter-evaluator agreement in our internal validation. * **Scale + anchors.** The right number of points for the question, with concrete example audios at each level. * **Pre-set attention checks** appropriate to the evaluation type. * **Default evaluator count and assignment policy** tuned to the typical confidence interval customers want. Available templates include: Five-point Likert mean opinion score with anchored examples. Compare a generated voice to a reference voice. ITU-T P.808 protocol for telecom-grade quality assessment. Two-way A/B preference between models. N-way ranking with adaptive pairing for fixed-budget global rank. Pairwise similarity to a reference under common-target conditions. ## Custom evaluation review If a template does not fit your question, you can design your own — and we will review it before launch. Write your question, scale, instructions, and anchors. Submit through the Workspace or your Slack channel. A Podonos evaluation specialist reads the draft for the failure modes we see most often: ambiguous wording, scale mismatch (too many points, too few), missing anchors, leading questions, and instructions that bury critical context. You receive concrete proposed wording with the reasoning behind each change. You can accept, reject, or iterate. Once you approve, the evaluation goes live with the same per-session quality controls as any template-based evaluation. Most custom evaluations need at least one round of edits. The most common fix is replacing a vague target term ("natural", "expressive", "good quality") with a behaviorally concrete prompt that anchors to a real-world reference. ## What review catches Words like "natural," "expressive," "good," or "high quality" mean different things to different evaluators. We replace them with concrete behavioral prompts. Five-point Likert is right for many tasks but wrong for others. Binary preferences should not have a 1–5 scale; subtle quality gradations need more than 3 points. Every scale point needs a concrete audio example. Without anchors, scores drift and inter-evaluator agreement collapses. Questions phrased to bias the answer ("how clearly does this voice articulate?" implies clarity is a feature). We rewrite to neutral framing. Critical context placed at the end of a 500-word instruction page is invisible. We surface it.