Skip to main content

Documentation Index

Fetch the complete documentation index at: https://podonos.com/docs/llms.txt

Use this file to discover all available pages before exploring further.

The six pillars

Acoustic environment

Measure ambient noise. Auto-detect headphone vs. earphone vs. speaker. No self-report.

Fatigue management

45–60 min session cap. Mandatory mid-session break.

Minimum-listen requirement

Every query’s audio must play to completion before a rating can be submitted.

Attention tests

Embedded throughout the session. Pattern of failures triggers automatic rejection.

Reliability scoring

Per-evaluator consistency check after the session. Unreliable evaluators are dropped and replaced.

Automatic audio review

Files screened for playability, corruption, length mismatches, and silent or voiceless content.

Acoustic environment detection

Most platforms ask “do you have headphones?” and “are you in a quiet room?” and trust the answer. We do not.
  • Ambient noise level. We measure the background noise level through the device and reject sessions above a threshold. Quiet rooms pass; cafés do not.
  • Headphone vs. earphone vs. speaker detection. Stereo and binaural cues are played back and the response pattern is analyzed. Speakers leak crosstalk in ways headphones do not — we detect that automatically.
  • Continuous monitoring. The check runs at session start and re-runs throughout. If conditions degrade mid-session, we flag it.
This is one of the most expensive defenses we run, and it eliminates a class of bias that destroys MOS-style evaluations: a listener on cheap speakers in a noisy room cannot reliably distinguish a high-fidelity render from a low-fidelity one.

Fatigue management

Listening attentively is exhausting. Our internal experiments confirm what the literature reports: evaluator accuracy degrades sharply past about an hour of continuous evaluation work.
1

Hard cap per session

No evaluator works longer than 60 minutes in a single Podonos session. Most sessions land between 45 and 60 minutes.
2

Smart splitting

If your evaluation is too large for one session, our assignment algorithm splits it into subsessions automatically — sized to the audio length and query count — and recruits more evaluators to cover the work.
3

Mandatory mid-session break

Partway through a session, evaluators are required to take a short break. We play a calming nature video with ambient sound. They cannot skip it. This reliably restores attention for the second half.
4

One session per evaluator per evaluation

An evaluator participates in exactly one session per evaluation. They cannot return later for “round two” and accumulate fatigue or memory effects.

Minimum-listen requirement

The rating UI enforces a minimum-listen rule on every query: a query cannot be submitted until all of its audio has been played to completion. This is the most basic engagement defense in the platform — it eliminates click-through-without-listening at the mechanical level.

Attention tests

Embedded throughout each session are attention checks designed to look like normal queries. They verify the evaluator is actually listening, has the headphones on, and is reading instructions before clicking.
  • Distribution. Sprinkled throughout, not bunched at the start.
  • Threshold. A single missed test is not a rejection — the threshold is calibrated against the base rate of legitimate confusion. A pattern of misses is rejection.
  • Clean output. Ratings from a rejected evaluator are stripped from the final aggregation.

Post-evaluation reliability loop

When a session ends, the data is not yet final. Podonos runs a post-evaluation reliability loop before any number reaches your report.

What we compute

For every evaluator in the cohort, we score:
  • Inter-evaluator agreement on the same queries.
  • Consistency on repeated and near-duplicate items embedded in the session.
  • Variance pattern across the session — sudden flips suggest a tired or distracted evaluator.
These signals combine into a per-evaluator reliability coefficient, and the cohort itself produces an aggregate reliability bar that your evaluation must clear.

The loop

1

Compute

Score every evaluator’s reliability and the cohort-level reliability coefficient.
2

Drop

Remove all data from evaluators whose reliability falls below the threshold. Their votes never reach your final report.
3

Backfill

Recruit fresh evaluators to replace the dropped slots so your votes-per-query target is restored.
4

Repeat

Re-compute reliability with the new cohort. If the bar is not yet met, drop and backfill again.
5

Release

The loop terminates only when the reliability bar is met. Final aggregated numbers are computed exclusively from evaluators who passed.
When the cohort reliability falls below the bar, we add 20% to 40% more evaluators in each backfill round and re-run the loop. The exact top-up depends on how far the cohort sits from the bar — a small gap needs a small top-up, a larger gap needs more. Every vote in your final aggregation comes from an evaluator who passed the reliability gate, in a cohort whose aggregate reliability cleared the bar.

Automatic audio review

Before evaluators ever see your files, every uploaded audio runs through a pipeline of automated checks:
Does the file decode cleanly? Are the headers valid? Is the codec supported by the playback device profile we ship?
We scan for truncated streams, header/payload mismatches, and zero-byte regions that indicate a broken upload or render.
We compare the duration declared in the file metadata against the actual decoded length. Mismatches frequently indicate truncation, codec issues, or generation failures upstream.
We detect files that contain no audio at all, no speech (only background or instrumental), or unusually long silence — common failure modes for TTS pipelines.
Files that fail audio review are surfaced back to you in the Workspace before evaluation begins, so a broken render does not waste evaluator budget.