One of the popular questions in speech/audio evaluation is the quality of the generated output. It is not directly about naturalness or intelligbility. Quality is
connected with many aspects including all those mentioned above.One of the widely used quality evaluation methods for speech/audio is mean opinion score (MOS). Its scale typically ranges from 1 (lowest quality) to 5 (highest quality like human) with 1 granularity (which is called five-point Likert Scale). Through podonos, you will evaluate the overall quality of your speech/audio in a fully managed service.
As one way of quality measurement, we demonstrate a quality measurement of synthesized human voice with additional noise. Below is an executable code example:
Another way of evaluating the speech quality is to follow ITU-T P.808 recommendation. It recommends 1) how to qualify the evaluators, 2) how to train them, and 3) how to collect the evaluation results and analyze them. It is a demanding process if you set up a system and run the evaluation. With Podonos, you can easily set up the whole evaluation with a few lines of code.
In this example, let’s assume you are developing a new speech enhancement algorithm, called MNSE (My New Speech Enhancement).
We will use mnse as the name of your package.Here is a code example that you can immediately execute.
Copy
Ask AI
import podonosfrom podonos import *import mnse # This is your speech enhancement package.client = podonos.init()etor = client.create_evaluator(name='mnse', desc='mnse_param1_param2', type='P808', lan='es-es', num_eval=15)total_audio_files = 50for i in total_audio_files: # Generate the enhanced audio file. enhanced_audio_path = mnse.enhance(f'/path/to/audio_{i}.wav') etor.add_file(File(path=enhanced_audio_path, model_tag='MNSE', tags=['bella', 'female', 'echo']))etor.close()
In addition to selecting a proper group of evaluators, ITU-T P.808 requires additional steps to ensure the evaluation environment is relevant and they conduct each session in an appropriate manner.
Following ITU-T P.808, we qualify the evaluators by reviewing hearing device, mother tongue, age, gender distribution, hearing capability, and geographics. For those who are disqualified, they are forced to stop the evaluation session.
While your audio files are evaluated, we automatically inject the gold references (so called anchor or hidden questions) that we know the correct responses. If the evaluators respond incorrectly, their evaluation results are automatically rejected afterwards.
Once the evaluation session is done, we automatically compute the overall evaluation reliability. Such evaluations that are significantly unreliable compared to other evaluations are marked and excluded.