The Zero Resource Speech Challenge

ZeroSpeech 2017

Results

The 2017 Challenge will appear as a special session at ASRU 2017 (Dec, 16-20, 2017, Okinawa, Japan, see the ASRU 2017 webpage. The challenge’s aims and metrics are an extension of those presented in [1]. Below are the baseline results for the hyper training set. The baselines and system results for the hyper test will be revealed after the paper's acceptance.

Track 1

Baseline and topline

The baseline ABX error rates for Track 1 are given in Table 1 For the baseline model, we used 39 dimensions MFCC+Delta+Delta2 features computed every 10ms and the ABX score was computed using the frame-wise cosine distance averaged along the DTW path. The topline consists in posteriorgrams from a supervised phone recognition Kaldi pipeline.

System’s comparisons

Author / Affiliation Development dataset Surprise dataset with
supervision
rank
English French Mandarin LANG1 LANG2
1s 10s 120s 1s 10s 120s 1s 10s 120s 1s 10s 120s 1s 10s 120s
Baseline
MFCC
23.4 23.4 23.4 25.2 25.5 25.2 21.3 21.3 21.3 23.6 23.2 23.0 30.0 29.5 29.5 No
Topline
kaldi posteriors, HMM-GMM
8.6 6.9 6.7 10.6 9.1 8.9 12.0 5.7 5.1 12.8 10.5 10.4 7.1 3.6 4.3 Yes
Chen et al.
Northwestern Polytechnical University
13.7 12.1 12.0 17.6 15.6 14.8 12.3 10.8 10.7 15.5 12.9 12.7 17.6 16.9 16.3 No 4
Räsänen et al.
Aalto University
N/A 15.0 N/A N/A 17.8 N/A N/A 13.5 N/A N/A 15.6 N/A N/A 20.1 N/A No /
Yuan et al. - #1
Northwestern Polytechnical University
14.2 12.1 11.8 18.9 15.8 15.2 12.8 11.1 10.9 16.4 13.3 13.0 19.2 17.3 16.7 No 9
Yuan et al. - #2
Northwestern Polytechnical University
14.0 11.9 11.7 18.6 15.5 14.9 12.7 10.8 10.7 16.2 12.9 12.6 19.5 17.1 16.6 No 7
Yuan et al. - #3
Northwestern Polytechnical University
13.6 11.5 11.3 17.7 14.8 14.4 12.9 10.7 10.5 15.8 12.4 12.3 18.7 17.4 17.0 Yes /
Shibata et al.
Tokyo Institute of Technology
10.1 9.2 8.2 13.7 12.4 10.8 10.4 9.5 8.0 11.6 9.9 8.7 11.5 10.2 8.6 Yes /
Pellegrini et al. - #1
IRIT Toulouse
17.6 16.3 16.4 20.3 17.6 17.3 14.7 13.5 13.4 19.4 16.2 15.9 22.8 23.1 23.1 No 11
Pellegrini et al. - #2
IRIT Toulouse
17.6 16.2 16.3 20.1 17.7 17.3 14.7 13.5 13.4 19.2 16.3 16.0 23.3 23.3 23.1 No 10
Heck et al.
NAIST
N/A N/A N/A N/A N/A N/A N/A N/A N/A 12.0 10.0 9.7 12.740 10.0 9.9 No /
Heck et al.
NAIST
10.1 8.7 8.5 13.6 11.7 11.3 8.8 7.4 7.3 11.9 10.0 9.7 13.0 10.0 9.9 No 1
Chen et al. - #2
Northwestern Polytechnical University
12.7 11.0 10.8 17.0 14.5 14.1 11.9 10.3 10.1 14.7 11.7 11.6 16.9 14.7 14.4 No 2
Shibata et al.
Tokyo Institute of Technology
7.9 7.4 6.9 11.2 10.8 9.8 7.8 7.5 6.7 9.3 8.6 7.8 8.3 7.9 7.2 Yes /
Ansari TK et al. -#1
Indian Institute Of Science
14.5 N/A 13.2 17.8 N/A 16.2 13.2 N/A 12.7 16.9 14.7 14.7 18.8 17.7 17.7 No 8
Ansari TK et al. -#2
Indian Institute Of Science
13.7 N/A 12.4 17.2 N/A 15.6 12.6 N/A 12.0 16.0 14.0 13.9 17.9 16.9 16.6 No 4
Ansari TK et al. -#3
Indian Institute Of Science
13.2 12.0 N/A 17.2 N/A 15.4 13.0 12.2 12.3 15.5 13.5 13.4 17.6 16.0 16.0 No 3
Ansari TK et al. -#4
Indian Institute Of Science
N/A N/A N/A N/A N/A N/A N/A N/A N/A 15.7 13.7 13.5 17.5 16.1 16.1 No 6
Table 1. Track1 - Across Speaker Results (preliminary results in grey)

Author / Affiliation Development dataset Surprise dataset with
supervision
rank
English French Mandarin LANG1 LANG2
1s 10s 120s 1s 10s 120s 1s 10s 120s 1s 10s 120s 1s 10s 120s
Baseline
MFCC
12.0 12.1 12.1 12.5 12.6 12.6 11.5 11.5 11.5 10.3 9.3 9.4 14.1 14.3 14.1 No
Topline
kaldi posteriors, HMM-GMM
6.5 5.3 5.1 8.0 6.8 6.8 9.5 4.2 4.0 8.7 7.1 7.0 6.6 4.6 3.4 Yes
Chen et al.
Northwestern Polytechnical University
8.5 7.3 7.2 11.1 9.5 9.4 10.5 8.5 8.4 7.6 6.2 6.3 11.7 9.9 9.8 No 4
Räsänen et al.
Aalto University
N/A 7.7 N/A N/A 9.3 N/A N/A 8.7 N/A N/A 6.6 N/A N/A 11.3 N/A No /
Yuan et al. - #1
Northwestern Polytechnical University
8.9 7.1 7.1 12.2 9.6 9.7 11.3 8.6 8.3 8.2 6.2 6.2 12.7 10.1 9.9 No 9
Yuan et al. - #2
Northwestern Polytechnical University
9.0 7.1 7.0 11.9 9.5 9.5 11.1 8.5 8.2 8.1 6.0 6.0 12.6 10.0 9.9 No 7
Yuan et al. - #3
Northwestern Polytechnical University
8.9 7.1 7.0 12.0 9.3 9.2 11.3 8.6 8.2 8.0 6.0 5.9 12.9 10.8 10.6 Yes /
Shibata et al.
Tokyo Institute of Technology
6.7 6.5 5.7 9.7 9.2 7.9 9.8 9.2 8.2 6.3 5.8 5.0 9.0 8.7 7.2 Yes /
Pellegrini et al. - #1
IRIT Toulouse
9.9 8.2 8.3 11.8 9.7 9.6 11.0 8.5 8.2 8.9 6.7 6.4 13.3 11.9 11.8 No 11
Pellegrini et al. - #2
IRIT Toulouse
9.8 8.1 8.2 11.6 9.5 9.3 10.9 8.4 8.1 8.8 6.6 6.3 13.1 11.7 11.7 No 10
Heck et al.
NAIST
N/A N/A N/A N/A N/A N/A N/A N/A N/A 6.7 5.6 5.3 12.3 8.8 8.4 No /
Heck et al.
NAIST
6.9 6.2 6.0 9.7 8.7 8.4 8.8 7.9 7.8 6.5 5.6 5.3 10.9 8.8 8.4 No 1
Chen et al. - #2
Northwestern Polytechnical University
8.5 7.3 7.2 11.2 9.4 9.4 10.5 8.7 8.5 7.6 6.2 6.1 11.6 9.8 9.6 No 2
Shibata et al.
Tokyo Institute of Technology
5.5 5.2 4.9 7.9 7.4 6.9 7.9 7.7 7.0 5.2 4.9 4.5 6.9 7.0 6.3 Yes /
Ansari TK et al. - #1
Indian Institute Of Science
7.4 N/A 6.6 9.8 N/A 8.5 9.3 N/A 8.3 6.9 6.1 6.0 9.9 9.2 9.1 No 8
Ansari TK et al. - #2
Indian Institute Of Science
7.4 N/A 6.6 9.8 N/A 8.4 9.2 N/A 8.2 6.8 6.0 6.0 10.1 9.6 9.6 No 4
Ansari TK et al. - #3
Indian Institute Of Science
7.7 6.8 N/A 10.4 N/A 8.8 10.4 9.3 9.1 7.3 6.2 6.1 11.1 10.3 10.2 No 3
Ansari TK et al. -#4
Indian Institute Of Science
N/A N/A N/A N/A N/A N/A N/A N/A N/A 7.6 6.4 6.2 11.6 10.9 10.7 No 6
Table 2. Track1 - Within Speaker Results (preliminary results in grey)

Track 2

Baseline and topline

For the baseline model, we used the JHU system described in Jansen & van Durme (2011) on PLP features. It performs DTW matching and uses random projections for increasing efficiency, and uses connected-component graph clustering as a second step. A topline using an Adaptor Grammar using a unigram model on the decoding provided by the phone recognizer is also reported; The topline performance is probably not attainable by unsupervised systems since it uses the gold transcription and it is more of a reference for the maximum value that it reasonable to expect for the used scores.






Grouping Type Token Boundary

Words Pairs NED Cov PR Re F PR Re F PR Re F PR Re F
English
JHU-PLP 12886 15730 33.9 7.9 34.7 96.6 47.9 5.0 0.7 1.2 3.9 0.3 0.5 33.9 3.1 5.7
AG 8881 4222990 0.0 100.0 99.9 100.0 100.0 48.7 53.9 51.1 60.7 69.4 64.7 84.3 94 88.9
10.5281/zenodo.810808 321603 277387 34.7 29.6 52.3 41.4 45.8 3.7 3.5 3.6 1.8 2.4 2.1 22.9 24.5 23.7
10.5281/zenodo.815005 42473 6803405 72.6 100.0 7.1 7.4 7.2 8.3 16.7 11.1 13.0 14.1 13.5 51.0 54.4 52.7
10.5281/zenodo.815468 92544 864639 71.4 71.0 3.3 76.2 6.2 3.9 8.6 5.4 3.2 3.5 3.3 27.1 39.4 32.1
10.5281/zenodo.815504 92465 866170 71.4 71.0 3.2 75.8 6.2 3.9 8.6 5.4 3.2 3.5 3.3 27.0 39.4 32.1
French
JHU-PLP 1803 1636 25.4 1.6 81.1 66.4 64.2 6.9 0.2 0.3 5.2 0.1 0.1 30.9 0.6 1.1
AG 7215 3211559 0.0 100.0 99.8 99.9 99.9 46.0 54.4 49.8 54.6 59.6 57.0 83.1 89.3 86.1
10.5281/zenodo.810808 195959 168767 24.8 28.8 64.1 39.3 48.4 4.7 4.0 4.3 2.9 3.5 3.2 24.8 25.6 25.2
10.5281/zenodo.815005 28733 4167064 67.3 97.2 10.1 6.9 8.2 3.1 6.3 4.2 3.5 3.9 3.7 37.8 41.6 39.6
10.5281/zenodo.815468 58701 515507 62.8 67.0 6.1 82.3 11.3 3.3 6.4 4.4 2.6 2.7 2.7 26.1 37.4 30.7
10.5281/zenodo.815504 58716 518113 62.7 67.0 6.3 81.7 11.6 3.3 6.4 4.4 2.6 2.7 2.7 26.0 37.4 30.7
Mandarin
JHU-PLP 156 160 30.7 2.9 30.2 96.7 44.7 4.5 0.1 0.2 4.0 0.1 0.1 37.5 0.9 1.8
AG 1240 791707 0.0 100.0 100.0 100.0 100.0 29.3 29.3 29.3 28.1 46.1 34.9 66.2 100.0 79.7
10.5281/zenodo.810808 26529 22644 53.4 37.1 37.4 15.0 21.4 3.4 2.6 3.0 1.6 2.7 2.0 22.2 31.3 26.0
10.5281/zenodo.815005 2967 356585 88.1 117.7 2.9 10.6 4.6 2.5 4.1 3.1 2.5 3.4 2.9 36.2 46.7 40.8
10.5281/zenodo.815468 2887 17845 80.2 43.4 2.4 45.0 4.6 4.5 2.9 3.5 3.6 2.0 2.6 22.8 18.6 20.5
10.5281/zenodo.815504 2882 17824 80.0 43.3 2.5 45.3 4.7 4.4 2.9 3.5 3.6 2.0 2.5 22.7 18.5 20.4
L1
JHU-PLP 2973 3315 30.5 3.0 54.8 94.6 64.9 5.5 0.3 0.6 4.0 0.1 0.2 28.2 1.2 2.3
AG 7664 2588808 0.0 100.0 100.0 100.0 100.0 28.4 29.5 29.0 42.3 62.7 50.5 70.0 98.3 81.8
10.5281/zenodo.810808 223188 191157 29.7 31.9 58.5 36.0 44.4 58.5 36.0 44.4 2.8 2.8 2.8 1.5 2.6 1.9
10.5281/zenodo.815005 28675 4258731 66.4 100.0 11.8 5.9 7.9 5.7 11.2 7.5 10.3 14.3 12.0 42.6 56.5 48.6
10.5281/zenodo.815468 60648 582009 59.9 71.8 5.7 63.8 10.4 3.2 6.8 4.3 2.4 3.1 2.7 20.6 37.2 26.6
10.5281/zenodo.815504 60489 588162 59.9 71.8 5.8 64.0 10.7 3.2 6.8 4.3 2.4 3.1 2.7 20.6 37.2 26.5
L2
JHU-PLP 462 545 33.5 3.2 39.1 72.1 32.8 2.3 0.1 0.2 1.6 0.0 0.1 25.3 1.0 2.0
AG 2472 602339 0.0 100.0 100.0 100.0 100.0 43.0 47.6 45.2 55.6 65.6 60.2 81.3 93.2 86.9
10.5281/zenodo.810808 18161 16620 27.0 17.4 57.8 16.4 25.5 2.9 1.9 2.3 1.4 1.1 1.2 18.9 14.2 16.2
10.5281/zenodo.815005 3593 439321 72.2 100.0 4.7 5.4 5.0 4.6 10.0 6.3 4.9 5.2 5.0 42.4 44.3 43.3
10.5281/zenodo.815468 5468 37191 56.8 47.8 7.6 43.3 12.8 5.9 8.2 6.9 6.1 4.2 4.9 29.6 29.6 29.6
10.5281/zenodo.815504 5460 37273 56.8 47.8 7.7 44.0 13.0 6.0 8.3 7.0 6.1 4.2 4.9 29.7 29.6 29.6

Table 3. Track 2 metrics for a baseline and topline model on English, French and Mandarin datasets.

Note: the spoken term discovery baseline can be found HERE. The Pitman-Yor Adaptor Grammar sampler can be found HERE. The baseline can be replicated by running the code keep in github

System’s comparisons

To be revealed later...

[1] M. Versteegh, R. Thiollière, T. Schatz, X.-N. Cao, X. Anguera, A. Jansen, and E. Dupoux, “The Zero Resource Speech Challenge 2015,” in INTERSPEECH-2015, 2015. Available: http://www.isca-speech.org/archive/interspeech_2015/i15_3169.html