Preference Evaluation

This experiment is to obtain the preference between baseline speechtokenizer and our proposed tokenizer

Overall

Proposed Neural Audio Codec (Model B)

-0.66

SpeechTokenizer (Model A)

Answers

1. SpeechTokenizer (Model A) is more natural

-1. Proposed Neural Audio Codec (Model B) is more natural

68 (6.80%)

206 (20.60%)

726 (72.60%)

Deep analysis

1. SpeechTokenizer (Model A) is more natural

-1. Proposed Neural Audio Codec (Model B) is more natural

		Statistics
00:00 --:-- 326_221_to_p293_236.wav 00:00 --:-- 326_221_to_p293_236.wav 00:00 --:-- 326_221_to_p293_236.wav	SpeechTokenizer Proposed Neural Audio Codec Target Voice	-0.85 0.17 0.08
00:00 --:-- 270_283_to_p293_219.wav 00:00 --:-- 270_283_to_p293_219.wav 00:00 --:-- 270_283_to_p293_219.wav	SpeechTokenizer Proposed Neural Audio Codec Target Voice	-0.40 0.32 0.15
00:00 --:-- 280_393_to_p300_048.wav 00:00 --:-- 280_393_to_p300_048.wav 00:00 --:-- 280_393_to_p300_048.wav	SpeechTokenizer Proposed Neural Audio Codec Target Voice	-1.00 0.00 0.00
00:00 --:-- 272_006_to_p303_144.wav 00:00 --:-- 272_006_to_p303_144.wav 00:00 --:-- 272_006_to_p303_144.wav	SpeechTokenizer Proposed Neural Audio Codec Target Voice	-0.90 0.14 0.07
00:00 --:-- 300_125_to_p362_195.wav 00:00 --:-- 300_125_to_p362_195.wav 00:00 --:-- 300_125_to_p362_195.wav	SpeechTokenizer Proposed Neural Audio Codec Target Voice	0.50 0.39 0.18
00:00 --:-- 277_010_to_p272_119.wav 00:00 --:-- 277_010_to_p272_119.wav 00:00 --:-- 277_010_to_p272_119.wav	SpeechTokenizer Proposed Neural Audio Codec Target Voice	-0.65 0.27 0.13
00:00 --:-- 345_004_to_p284_022.wav 00:00 --:-- 345_004_to_p284_022.wav 00:00 --:-- 345_004_to_p284_022.wav	SpeechTokenizer Proposed Neural Audio Codec Target Voice	-0.90 0.14 0.07
00:00 --:-- 311_366_to_p251_303.wav 00:00 --:-- 311_366_to_p251_303.wav 00:00 --:-- 311_366_to_p251_303.wav	SpeechTokenizer Proposed Neural Audio Codec Target Voice	-0.65 0.27 0.13

Overall

Proposed Neural Audio Codec (Model B)

-0.43

SpeechTokenizer (Model A)

Answers

1. SpeechTokenizer (Model A) 's voice is similar

-1. Proposed Neural Audio Codec (Model B) 's voice is similar

112 (11.20%)

344 (34.40%)

544 (54.40%)

Deep analysis

1. SpeechTokenizer (Model A) 's voice is similar

-1. Proposed Neural Audio Codec (Model B) 's voice is similar

		Statistics
00:00 --:-- 326_221_to_p293_236.wav 00:00 --:-- 326_221_to_p293_236.wav 00:00 --:-- 326_221_to_p293_236.wav	SpeechTokenizer Proposed Neural Audio Codec Target Voice	-0.70 0.22 0.11
00:00 --:-- 270_283_to_p293_219.wav 00:00 --:-- 270_283_to_p293_219.wav 00:00 --:-- 270_283_to_p293_219.wav	SpeechTokenizer Proposed Neural Audio Codec Target Voice	-0.15 0.35 0.17
00:00 --:-- 280_393_to_p300_048.wav 00:00 --:-- 280_393_to_p300_048.wav 00:00 --:-- 280_393_to_p300_048.wav	SpeechTokenizer Proposed Neural Audio Codec Target Voice	-0.90 0.14 0.07
00:00 --:-- 272_006_to_p303_144.wav 00:00 --:-- 272_006_to_p303_144.wav 00:00 --:-- 272_006_to_p303_144.wav	SpeechTokenizer Proposed Neural Audio Codec Target Voice	-0.65 0.27 0.13
00:00 --:-- 300_125_to_p362_195.wav 00:00 --:-- 300_125_to_p362_195.wav 00:00 --:-- 300_125_to_p362_195.wav	SpeechTokenizer Proposed Neural Audio Codec Target Voice	0.40 0.35 0.17
00:00 --:-- 277_010_to_p272_119.wav 00:00 --:-- 277_010_to_p272_119.wav 00:00 --:-- 277_010_to_p272_119.wav	SpeechTokenizer Proposed Neural Audio Codec Target Voice	-0.20 0.39 0.19
00:00 --:-- 345_004_to_p284_022.wav 00:00 --:-- 345_004_to_p284_022.wav 00:00 --:-- 345_004_to_p284_022.wav	SpeechTokenizer Proposed Neural Audio Codec Target Voice	-0.75 0.21 0.10
00:00 --:-- 311_366_to_p251_303.wav 00:00 --:-- 311_366_to_p251_303.wav 00:00 --:-- 311_366_to_p251_303.wav	SpeechTokenizer Proposed Neural Audio Codec Target Voice	-0.55 0.24 0.11

Overall

Proposed Neural Audio Codec (Model B)

-0.62

SpeechTokenizer (Model A)

Answers

1. SpeechTokenizer (Model A) is more inteligible

-1. Proposed Neural Audio Codec (Model B) is more inteligible

82 (8.20%)

218 (21.80%)

700 (70.00%)

Deep analysis

1. SpeechTokenizer (Model A) is more inteligible

-1. Proposed Neural Audio Codec (Model B) is more inteligible

		Statistics
00:00 --:-- 326_221_to_p293_236.wav 00:00 --:-- 326_221_to_p293_236.wav 00:00 --:-- 326_221_to_p293_236.wav	SpeechTokenizer Proposed Neural Audio Codec Target Voice	-0.75 0.21 0.10
00:00 --:-- 270_283_to_p293_219.wav 00:00 --:-- 270_283_to_p293_219.wav 00:00 --:-- 270_283_to_p293_219.wav	SpeechTokenizer Proposed Neural Audio Codec Target Voice	-0.50 0.32 0.15
00:00 --:-- 280_393_to_p300_048.wav 00:00 --:-- 280_393_to_p300_048.wav 00:00 --:-- 280_393_to_p300_048.wav	SpeechTokenizer Proposed Neural Audio Codec Target Voice	-1.00 0.00 0.00
00:00 --:-- 272_006_to_p303_144.wav 00:00 --:-- 272_006_to_p303_144.wav 00:00 --:-- 272_006_to_p303_144.wav	SpeechTokenizer Proposed Neural Audio Codec Target Voice	-0.85 0.17 0.08
00:00 --:-- 300_125_to_p362_195.wav 00:00 --:-- 300_125_to_p362_195.wav 00:00 --:-- 300_125_to_p362_195.wav	SpeechTokenizer Proposed Neural Audio Codec Target Voice	0.50 0.36 0.17
00:00 --:-- 277_010_to_p272_119.wav 00:00 --:-- 277_010_to_p272_119.wav 00:00 --:-- 277_010_to_p272_119.wav	SpeechTokenizer Proposed Neural Audio Codec Target Voice	-0.35 0.31 0.15
00:00 --:-- 345_004_to_p284_022.wav 00:00 --:-- 345_004_to_p284_022.wav 00:00 --:-- 345_004_to_p284_022.wav	SpeechTokenizer Proposed Neural Audio Codec Target Voice	-0.70 0.27 0.13
00:00 --:-- 311_366_to_p251_303.wav 00:00 --:-- 311_366_to_p251_303.wav 00:00 --:-- 311_366_to_p251_303.wav	SpeechTokenizer Proposed Neural Audio Codec Target Voice	-0.55 0.24 0.11

Overall

Proposed Neural Audio Codec (Model B)

-0.66

SpeechTokenizer (Model A)

Answers

1. SpeechTokenizer (Model A) is more natural

-1. Proposed Neural Audio Codec (Model B) is more natural

68 (6.80%)

206 (20.60%)

726 (72.60%)

Deep analysis

1. SpeechTokenizer (Model A) is more natural

-1. Proposed Neural Audio Codec (Model B) is more natural

		Statistics
00:00 --:-- 326_221_to_p293_236.wav 00:00 --:-- 326_221_to_p293_236.wav 00:00 --:-- 326_221_to_p293_236.wav	SpeechTokenizer Proposed Neural Audio Codec Target Voice	-0.85 0.17 0.08
00:00 --:-- 270_283_to_p293_219.wav 00:00 --:-- 270_283_to_p293_219.wav 00:00 --:-- 270_283_to_p293_219.wav	SpeechTokenizer Proposed Neural Audio Codec Target Voice	-0.40 0.32 0.15
00:00 --:-- 280_393_to_p300_048.wav 00:00 --:-- 280_393_to_p300_048.wav 00:00 --:-- 280_393_to_p300_048.wav	SpeechTokenizer Proposed Neural Audio Codec Target Voice	-1.00 0.00 0.00
00:00 --:-- 272_006_to_p303_144.wav 00:00 --:-- 272_006_to_p303_144.wav 00:00 --:-- 272_006_to_p303_144.wav	SpeechTokenizer Proposed Neural Audio Codec Target Voice	-0.90 0.14 0.07
00:00 --:-- 300_125_to_p362_195.wav 00:00 --:-- 300_125_to_p362_195.wav 00:00 --:-- 300_125_to_p362_195.wav	SpeechTokenizer Proposed Neural Audio Codec Target Voice	0.50 0.39 0.18
00:00 --:-- 277_010_to_p272_119.wav 00:00 --:-- 277_010_to_p272_119.wav 00:00 --:-- 277_010_to_p272_119.wav	SpeechTokenizer Proposed Neural Audio Codec Target Voice	-0.65 0.27 0.13
00:00 --:-- 345_004_to_p284_022.wav 00:00 --:-- 345_004_to_p284_022.wav 00:00 --:-- 345_004_to_p284_022.wav	SpeechTokenizer Proposed Neural Audio Codec Target Voice	-0.90 0.14 0.07
00:00 --:-- 311_366_to_p251_303.wav 00:00 --:-- 311_366_to_p251_303.wav 00:00 --:-- 311_366_to_p251_303.wav	SpeechTokenizer Proposed Neural Audio Codec Target Voice	-0.65 0.27 0.13

Overall

Proposed Neural Audio Codec (Model B)

-0.43

SpeechTokenizer (Model A)

Answers

1. SpeechTokenizer (Model A) 's voice is similar

-1. Proposed Neural Audio Codec (Model B) 's voice is similar

112 (11.20%)

344 (34.40%)

544 (54.40%)

Deep analysis

1. SpeechTokenizer (Model A) 's voice is similar

-1. Proposed Neural Audio Codec (Model B) 's voice is similar

		Statistics
00:00 --:-- 326_221_to_p293_236.wav 00:00 --:-- 326_221_to_p293_236.wav 00:00 --:-- 326_221_to_p293_236.wav	SpeechTokenizer Proposed Neural Audio Codec Target Voice	-0.70 0.22 0.11
00:00 --:-- 270_283_to_p293_219.wav 00:00 --:-- 270_283_to_p293_219.wav 00:00 --:-- 270_283_to_p293_219.wav	SpeechTokenizer Proposed Neural Audio Codec Target Voice	-0.15 0.35 0.17
00:00 --:-- 280_393_to_p300_048.wav 00:00 --:-- 280_393_to_p300_048.wav 00:00 --:-- 280_393_to_p300_048.wav	SpeechTokenizer Proposed Neural Audio Codec Target Voice	-0.90 0.14 0.07
00:00 --:-- 272_006_to_p303_144.wav 00:00 --:-- 272_006_to_p303_144.wav 00:00 --:-- 272_006_to_p303_144.wav	SpeechTokenizer Proposed Neural Audio Codec Target Voice	-0.65 0.27 0.13
00:00 --:-- 300_125_to_p362_195.wav 00:00 --:-- 300_125_to_p362_195.wav 00:00 --:-- 300_125_to_p362_195.wav	SpeechTokenizer Proposed Neural Audio Codec Target Voice	0.40 0.35 0.17
00:00 --:-- 277_010_to_p272_119.wav 00:00 --:-- 277_010_to_p272_119.wav 00:00 --:-- 277_010_to_p272_119.wav	SpeechTokenizer Proposed Neural Audio Codec Target Voice	-0.20 0.39 0.19
00:00 --:-- 345_004_to_p284_022.wav 00:00 --:-- 345_004_to_p284_022.wav 00:00 --:-- 345_004_to_p284_022.wav	SpeechTokenizer Proposed Neural Audio Codec Target Voice	-0.75 0.21 0.10
00:00 --:-- 311_366_to_p251_303.wav 00:00 --:-- 311_366_to_p251_303.wav 00:00 --:-- 311_366_to_p251_303.wav	SpeechTokenizer Proposed Neural Audio Codec Target Voice	-0.55 0.24 0.11

Overall

Proposed Neural Audio Codec (Model B)

-0.62

SpeechTokenizer (Model A)

Answers

1. SpeechTokenizer (Model A) is more inteligible

-1. Proposed Neural Audio Codec (Model B) is more inteligible

82 (8.20%)

218 (21.80%)

700 (70.00%)

Deep analysis

1. SpeechTokenizer (Model A) is more inteligible

-1. Proposed Neural Audio Codec (Model B) is more inteligible

		Statistics
00:00 --:-- 326_221_to_p293_236.wav 00:00 --:-- 326_221_to_p293_236.wav 00:00 --:-- 326_221_to_p293_236.wav	SpeechTokenizer Proposed Neural Audio Codec Target Voice	-0.75 0.21 0.10
00:00 --:-- 270_283_to_p293_219.wav 00:00 --:-- 270_283_to_p293_219.wav 00:00 --:-- 270_283_to_p293_219.wav	SpeechTokenizer Proposed Neural Audio Codec Target Voice	-0.50 0.32 0.15
00:00 --:-- 280_393_to_p300_048.wav 00:00 --:-- 280_393_to_p300_048.wav 00:00 --:-- 280_393_to_p300_048.wav	SpeechTokenizer Proposed Neural Audio Codec Target Voice	-1.00 0.00 0.00
00:00 --:-- 272_006_to_p303_144.wav 00:00 --:-- 272_006_to_p303_144.wav 00:00 --:-- 272_006_to_p303_144.wav	SpeechTokenizer Proposed Neural Audio Codec Target Voice	-0.85 0.17 0.08
00:00 --:-- 300_125_to_p362_195.wav 00:00 --:-- 300_125_to_p362_195.wav 00:00 --:-- 300_125_to_p362_195.wav	SpeechTokenizer Proposed Neural Audio Codec Target Voice	0.50 0.36 0.17
00:00 --:-- 277_010_to_p272_119.wav 00:00 --:-- 277_010_to_p272_119.wav 00:00 --:-- 277_010_to_p272_119.wav	SpeechTokenizer Proposed Neural Audio Codec Target Voice	-0.35 0.31 0.15
00:00 --:-- 345_004_to_p284_022.wav 00:00 --:-- 345_004_to_p284_022.wav 00:00 --:-- 345_004_to_p284_022.wav	SpeechTokenizer Proposed Neural Audio Codec Target Voice	-0.70 0.27 0.13
00:00 --:-- 311_366_to_p251_303.wav 00:00 --:-- 311_366_to_p251_303.wav 00:00 --:-- 311_366_to_p251_303.wav	SpeechTokenizer Proposed Neural Audio Codec Target Voice	-0.55 0.24 0.11