The 2015 Challenge appeared as a special session at Interspeech 2015 (Sept, 6-10, 2015, Dresden, see the Interspeech 2015 proceedings. The challenge’s aims were presented in  and the main results summarized in . The references for Track 1 are in [, , , , , & ] and for Track 2 in [ & ]. Further papers were published in the SLTU 2016 special topic on Zero resource speech technology and elsewhere [, , & ].
The baseline and topline ABX error rates for Track 1 are given in Table 1 (see also []). For the baseline model, we used 13 dimensions MFCC features computed every 10ms and the ABX score was computed using the cosine distance. For the topline model, we used posteriorgrams extracted from a Kaldi GMM-HMM pipeline with MFCC and Delta and Delta-Delta features, Gaussian mixtures, triphone word-position-dependent states, fMLLR talker adaptation, with a bigram word language model. The exact same Kaldi pipeline was used for the two languages and gave a phone error rate (PER) of 26.4% for English, and 7.5% for Tsonga. Note that the two corpora are quite different: The English corpus contains spontaneous, casual speech; the Tsonga corpus contain read speech constructed out of a small vocabulary, and tailored for constructing speech recognition applications. The acoustic and language models were trained on the part of the corpora not used in the evaluation, and the posterior fed into the ABX evaluation software using the KL divergence. Unsupervised models are expected to fall in between the performance of these two systems.
For the baseline model, we used the JHU system described in Jansen & van Durme (2011) on PLP features. It performs DTW matching and uses random projections for increasing efficiency, and uses connected component clustering as a second step. The topline is an Adaptor Grammar using a unigram grammar, run on the gold phoneme transcription. Here, the topline performance is probably not attainable by unsupervised systems since it uses the gold transcription. It is more of a reference for the maximum value that it reasonable to expect for the metrics used.
 D. Renshaw, H. Kamper, A. Jansen, and S. Goldwater, “A Comparison of Neural Network Methods for Unsupervised Representation Learning on the Zero Resource Speech Challenge,” in Sixteenth Annual Conference of the International Speech Communication Association, 2015. Available: http://www.isca-speech.org/archive/interspeech_2015/i15_3199.html
 H. Chen, C.-C. Leung, L. Xie, B. Ma, and H. Li, “Parallel Inference of Dirichlet Process Gaussian Mixture Models for Unsupervised Acoustic Modeling: A Feasibility Study,” in Sixteenth Annual Conference of the International Speech Communication Association, 2015. Available: http://www.isca-speech.org/archive/interspeech_2015/i15_3189.html
 P. Baljekar, S. Sitaram, P. K. Muthukumar, and A. W. Black, “Using Articulatory Features and Inferred Phonological Segments in Zero Resource Speech Processing,” in Sixteenth Annual Conference of the International Speech Communication Association, 2015. Available: http://www.cs.cmu.edu/~pbaljeka/papers/IS2015.pdf
 V. Lyzinski, G. Sell, and A. Jansen, “An Evaluation of Graph Clustering Methods for Unsupervised Term Discovery,” in Sixteenth Annual Conference of the International Speech Communication Association, 2015. Available: https://ccrma.stanford.edu/~gsell/pubs/2015_IS1.pdf
 N. Zeghidour, G. Synnaeve, M. Versteegh, and E. Dupoux, “A Deep Scattering Spectrum - Deep Siamese network Pipeline For Unsupervised Acoustic Modeling,” in 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2016, pp. 4965–4969.
 M. Heck, S. Sakti, and S. Nakamura, “Unsupervised Linear Discriminant Analysis for Supporting DPGMM Clustering in the Zero Resource Scenario,” Procedia Computer Science, vol. 81, pp. 73–79, 2016.
 B. M. L. Srivastava and M. Shrivastava, “Articulatory Gesture Rich Representation Learning of Phonological Units in Low Resource Settings,” in International Conference on Statistical Language and Speech Processing, 2016, pp. 80–95.