> ## Documentation Index > Fetch the complete documentation index at: https://podonos.com/docs/llms.txt > Use this file to discover all available pages before exploring further. # In-Session Quality Control > Six layers of automated quality control that run during every Podonos evaluation. ## The six pillars Measure ambient noise. Auto-detect headphone vs. earphone vs. speaker. No self-report. 45–60 min session cap. Mandatory mid-session break. Every query's audio must play to completion before a rating can be submitted. Embedded throughout the session. Pattern of failures triggers automatic rejection. Per-evaluator consistency check after the session. Unreliable evaluators are dropped and replaced. Files screened for playability, corruption, length mismatches, and silent or voiceless content. ## Acoustic environment detection Most platforms ask "do you have headphones?" and "are you in a quiet room?" and trust the answer. We do not. * **Ambient noise level.** We measure the background noise level through the device and reject sessions above a threshold. Quiet rooms pass; cafés do not. * **Headphone vs. earphone vs. speaker detection.** Stereo and binaural cues are played back and the response pattern is analyzed. Speakers leak crosstalk in ways headphones do not — we detect that automatically. * **Continuous monitoring.** The check runs at session start and re-runs throughout. If conditions degrade mid-session, we flag it. This is one of the most expensive defenses we run, and it eliminates a class of bias that destroys MOS-style evaluations: a listener on cheap speakers in a noisy room cannot reliably distinguish a high-fidelity render from a low-fidelity one. ## Fatigue management Listening attentively is exhausting. Our internal experiments confirm what the literature reports: evaluator accuracy degrades sharply past about an hour of continuous evaluation work. No evaluator works longer than 60 minutes in a single Podonos session. Most sessions land between 45 and 60 minutes. If your evaluation is too large for one session, our assignment algorithm splits it into subsessions automatically — sized to the audio length and query count — and recruits more evaluators to cover the work. Partway through a session, evaluators are required to take a short break. We play a calming nature video with ambient sound. They cannot skip it. This reliably restores attention for the second half. An evaluator participates in exactly one session per evaluation. They cannot return later for "round two" and accumulate fatigue or memory effects. ## Minimum-listen requirement The rating UI enforces a minimum-listen rule on every query: a query cannot be submitted until all of its audio has been played to completion. This is the most basic engagement defense in the platform — it eliminates click-through-without-listening at the mechanical level. ## Attention tests Embedded throughout each session are attention checks designed to look like normal queries. They verify the evaluator is actually listening, has the headphones on, and is reading instructions before clicking. * **Distribution.** Sprinkled throughout, not bunched at the start. * **Threshold.** A single missed test is not a rejection — the threshold is calibrated against the base rate of legitimate confusion. A pattern of misses is rejection. * **Clean output.** Ratings from a rejected evaluator are stripped from the final aggregation. ## Post-evaluation reliability loop When a session ends, the data is not yet final. Podonos runs a post-evaluation reliability loop before any number reaches your report. ### What we compute For every evaluator in the cohort, we score: * **Inter-evaluator agreement** on the same queries. * **Consistency** on repeated and near-duplicate items embedded in the session. * **Variance pattern** across the session — sudden flips suggest a tired or distracted evaluator. These signals combine into a per-evaluator reliability coefficient, and the cohort itself produces an aggregate reliability bar that your evaluation must clear. ### The loop Score every evaluator's reliability and the cohort-level reliability coefficient. Remove all data from evaluators whose reliability falls below the threshold. Their votes never reach your final report. Recruit fresh evaluators to replace the dropped slots so your votes-per-query target is restored. Re-compute reliability with the new cohort. If the bar is not yet met, drop and backfill again. The loop terminates only when the reliability bar is met. Final aggregated numbers are computed exclusively from evaluators who passed. When the cohort reliability falls below the bar, we add 20% to 40% more evaluators in each backfill round and re-run the loop. The exact top-up depends on how far the cohort sits from the bar — a small gap needs a small top-up, a larger gap needs more. Every vote in your final aggregation comes from an evaluator who passed the reliability gate, in a cohort whose aggregate reliability cleared the bar. ## Automatic audio review Before evaluators ever see your files, every uploaded audio runs through a pipeline of automated checks: Does the file decode cleanly? Are the headers valid? Is the codec supported by the playback device profile we ship? We scan for truncated streams, header/payload mismatches, and zero-byte regions that indicate a broken upload or render. We compare the duration declared in the file metadata against the actual decoded length. Mismatches frequently indicate truncation, codec issues, or generation failures upstream. We detect files that contain no audio at all, no speech (only background or instrumental), or unusually long silence — common failure modes for TTS pipelines. Files that fail audio review are surfaced back to you in the Workspace before evaluation begins, so a broken render does not waste evaluator budget.