The Zero Resource Speech Challenge

ZeroSpeech 2015

Results

The 2015 Challenge appeared as a special session at Interspeech 2015 (Sept, 6-10, 2015, Dresden, see the Interspeech 2015 proceedings. The challenge’s aims were presented in [1] and the main results summarized in [2]. The references for Track 1 are in [[3], [4], [5], [6], [7], & [8]] and for Track 2 in [[9] & [10]]. Further papers were published in the SLTU 2016 special topic on Zero resource speech technology and elsewhere [[11], [12], & [13]].

Track 1

Baseline and topline

The baseline and topline ABX error rates for Track 1 are given in Table 1 (see also [[1]]). For the baseline model, we used 13 dimensions MFCC features computed every 10ms and the ABX score was computed using the cosine distance. For the topline model, we used posteriorgrams extracted from a Kaldi GMM-HMM pipeline with MFCC and Delta and Delta-Delta features, Gaussian mixtures, triphone word-position-dependent states, fMLLR talker adaptation, with a bigram word language model. The exact same Kaldi pipeline was used for the two languages and gave a phone error rate (PER) of 26.4% for English, and 7.5% for Tsonga. Note that the two corpora are quite different: The English corpus contains spontaneous, casual speech; the Tsonga corpus contain read speech constructed out of a small vocabulary, and tailored for constructing speech recognition applications. The acoustic and language models were trained on the part of the corpora not used in the evaluation, and the posterior fed into the ABX evaluation software using the KL divergence. Unsupervised models are expected to fall in between the performance of these two systems.

Table 1. Track 1 ABX error rate (%) for baseline and topline models on the English and Tsonga datasets.

Note: the kaldi recipes can be found HERE:

System’s comparisons

The up-to-date results can be found HERE. If your result does not appear there, please email us.

Track 2

Baseline and topline

For the baseline model, we used the JHU system described in Jansen & van Durme (2011) on PLP features. It performs DTW matching and uses random projections for increasing efficiency, and uses connected component clustering as a second step. The topline is an Adaptor Grammar using a unigram grammar, run on the gold phoneme transcription. Here, the topline performance is probably not attainable by unsupervised systems since it uses the gold transcription. It is more of a reference for the maximum value that it reasonable to expect for the metrics used.

Table 2. Track 2 metrics for baseline and topline models on the English and Tsonga datasets

Note: the spoken term discovery baseline can be found HERE. The Pitman-Yor Adaptor Grammar sampler can be found HERE.

System’s comparisons

The results can be found HERE.

Challenge References

[1] M. Versteegh, R. Thiollière, T. Schatz, X.-N. Cao, X. Anguera, A. Jansen, and E. Dupoux, “The Zero Resource Speech Challenge 2015,” in INTERSPEECH-2015, 2015. Available: http://www.isca-speech.org/archive/interspeech_2015/i15_3169.html

[2] M. Versteegh, X. Anguera, A. Jansen, and E. Dupoux, “The Zero Resource Speech Challenge 2015: Proposed Approaches and Results,” in SLTU-2016, 2016. Available: http://www.lscp.net/persons/dupoux/papers/Versteegh_AJD_2016.ZeroSpeech%202015%20results.SLTU.pdf

[3] R. Thiollière, E. Dunbar, G. Synnaeve, M. Versteegh, and E. Dupoux, “A Hybrid Dynamic Time Warping-Deep Neural Network Architecture for Unsupervised Acoustic Modeling,” in INTERSPEECH-2015, 2015. Available: http://www.isca-speech.org/archive/interspeech_2015/i15_3169.html

[4] L. Badino, A. Mereta, and L. Rosasco, “Discovering Discrete Subword Units with Binarized Autoencoders and Hidden-Markov-Model Encoders,” in Sixteenth Annual Conference of the International Speech Communication Association, 2015. Available: http://www.isca-speech.org/archive/interspeech_2015/i15_3174.html

[5] D. Renshaw, H. Kamper, A. Jansen, and S. Goldwater, “A Comparison of Neural Network Methods for Unsupervised Representation Learning on the Zero Resource Speech Challenge,” in Sixteenth Annual Conference of the International Speech Communication Association, 2015. Available: http://www.isca-speech.org/archive/interspeech_2015/i15_3199.html

[6] W. Agenbag and T. Niesler, “Automatic Segmentation and Clustering of Speech Using Sparse Coding and Metaheuristic Search,” in Sixteenth Annual Conference of the International Speech Communication Association, 2015. Available: http://www.isca-speech.org/archive/interspeech_2015/i15_3184.html

[7] H. Chen, C.-C. Leung, L. Xie, B. Ma, and H. Li, “Parallel Inference of Dirichlet Process Gaussian Mixture Models for Unsupervised Acoustic Modeling: A Feasibility Study,” in Sixteenth Annual Conference of the International Speech Communication Association, 2015. Available: http://www.isca-speech.org/archive/interspeech_2015/i15_3189.html

[8] P. Baljekar, S. Sitaram, P. K. Muthukumar, and A. W. Black, “Using Articulatory Features and Inferred Phonological Segments in Zero Resource Speech Processing,” in Sixteenth Annual Conference of the International Speech Communication Association, 2015. Available: http://www.cs.cmu.edu/~pbaljeka/papers/IS2015.pdf

[9] O. R”as”anen, G. Doyle, and M. C. Frank, “Unsupervised word discovery from speech using automatic segmentation into syllable-like units,” in Proceedings of Interspeech, 2015. Available: http://www.isca-speech.org/archive/interspeech_2015/i15_3204.html

[10] V. Lyzinski, G. Sell, and A. Jansen, “An Evaluation of Graph Clustering Methods for Unsupervised Term Discovery,” in Sixteenth Annual Conference of the International Speech Communication Association, 2015. Available: https://ccrma.stanford.edu/~gsell/pubs/2015_IS1.pdf

[11] N. Zeghidour, G. Synnaeve, M. Versteegh, and E. Dupoux, “A Deep Scattering Spectrum - Deep Siamese network Pipeline For Unsupervised Acoustic Modeling,” in 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2016, pp. 4965–4969.

[12] M. Heck, S. Sakti, and S. Nakamura, “Unsupervised Linear Discriminant Analysis for Supporting DPGMM Clustering in the Zero Resource Scenario,” Procedia Computer Science, vol. 81, pp. 73–79, 2016.

[13] B. M. L. Srivastava and M. Shrivastava, “Articulatory Gesture Rich Representation Learning of Phonological Units in Low Resource Settings,” in International Conference on Statistical Language and Speech Processing, 2016, pp. 80–95.