The Zero Resource Speech Challenge

ZeroSpeech 2015

Task and intended goal

This challenge targets the unsupervised discovery of linguistic units from raw speech in an unknown language, focussing on two levels of linguistic structure: subword units and word units, respectively.

Psycholinguistic evidence shows that infants complete the learning of subword units and start to construct a recognition lexicon within the first year of life without any access to orthographic or phonetic labels, although, they may have multimodal input and proprioceptive feedback (babbling) which is not modeled in this challenge. Here, we set up the rather extreme situation where linguistic units have to be learned from audio only. These two levels have already been investigated in previous work (see [1-6], and [7-10], respectively), but the performance of the different systems has not yet been compared using common evaluation metrics and datasets. In the first track, we use a psychophysically inspired evaluation task (minimal pair ABX discrimination), and, in the second, metrics inspired by the ones used in NLP word segmentation applications (segmentation and token F scores).

  • Track 1: unsupervised subword modeling. The aim in this task is to construct a representation of speech sounds which is robust to within- and between-talker variation and supports word identification. The metric we will use is the ABX discriminability between phonemic minimal pairs (see [11,12]). The ABX discriminability between the minimal pair "beg" and "bag" is defined as the probability that A and X are further apart than B and X, where A and X are tokens of "beg", and B a token of "bag" (or vice versa), distance being defined as the DTW divergence of the representations of the tokens. Our global ABX discriminability score aggregates over the entire set of minimal pairs like "beg"-"bag" to be found in the corpus. We analyze separately the effect of within- and between-talker variation.
  • Track 2: spoken term discovery. The aim in this task is the unsupervised discovery of "words" defined as recurring speech fragments. The systems should take raw speech as input and output a list of speech fragments (timestamps referring to the original audio file) together with a discrete label for category membership. The evaluation will use the suite of F-score metrics described in [13], which enables detailed assessment of the different components of a spoken term discovery pipeline (matching, clustering, segmentation, parsing) and so will support a direct comparison with NLP models of unsupervised word segmentation.

You can find more details on these two tracks in the relevant tabs (Track 1 and Track 2).

This challenge first appeared as a special session at Interspeech 2015 (Sept, 6-10, 2015, Dresden). It uses only open access materials. It remains forever open to participants who can without limitations register, download the materials and try to beat the best systems. See below for registration.

Data and Sharing

To encourage teams from both ASR and non-ASR communities to apply to these tracks, all of the resources for this challenge (software for evaluation and baselines, datasets) are free and open source. We strongly encourage applicants to make their systems available in an open source fashion. This is not only scientific good practice (enables verification and replication), but it is our belief that it will encourage the growth of this domain by facilitating the emergence of new teams and participants.

Data for the challenge is drawn from two languages, one English dataset that we is nevertheless treated as a zero resource language (which means no pretraining with an English dataset will be allowed), and a low resource language, Xitsonga. The data is made available in three sets.

  • the sample set (2 speakers, 40 min each, English) is provided to for anyone to download (see the Getting Started) together with the evaluation software.
  • the English test dataset (casual conversations, 12 speakers, 16-30 min each, total 5h)
  • the Xitsonga test dataset (read speech, 24 speakers, 2-29 minutes each, total 2h30).

To get these datasets, see Registration below. All datasets have been prepared in the following way:

  • the original recordings were segmented into short files that contains only `clean speech', ie, no overlap, pauses, or nonspeech noises, and contain only the speech of a single speakers.
  • the file names contain a talker ID. We kept this information on the basis of the fact that infants arguably have access to this information when they learn their language, and that it is relatively easy to recover anyway. Therefore, the proposed systems can openly use it.

Ground rule

This challenge is primarily driven by a scientific question: how could an infant or a system learn language(s) in an unsupervised fashion? We expect therefore that the submissions' proposals will emphasize novel and interesting ideas (as opposed to trying to get the best result through all possible means). Since we provide the evaluation software, there is the distinct possibility that it can be used to optimize system parameters according to the particular corpus at hand. Doing this would blur the comparison between competing ideas and architectures, especially if this information is not disclosed. We therefore ask kindly the participants to disclose whenever they publish their work whether and how they have used the evaluation software to tune particular system parameters.

Similarly, competitors should disclose the type of information they have used for training their systems. In order to compare systems, we will distinguish those that use absolutely no training to derive the speech features (bare signal processing systems), systems that use unsupervised training on the provided datasets (unsupervised systems), and systems that use supervised training on some other languages or mixtures of languages (transfer systems). Training features or models with another English dataset will be prohibited, except for baseline comparison.

Registration

As said above, the challenge remains open and participants can compete and try to beat the current best system at any time. The only requirement is that the results are send to the organizers so that we can update the result page.

To register, send an email to zerospeech2015@gmail.com and use this github repository for instructions. If you encounter a problem, please send us an email (zerospeech2015@gmail.com)

You can try out your systems without registering by downloading the starter kit (Getting Started)

Organizers

Challenge Organization

  • Xavier Anguera (Telefonica)
  • Emmanuel Dupoux (Ecole des Hautes Etudes en Sciences Sociales, Paris)
  • Aren Jansen (Johns Hopkins University, Baltimore)
  • Maarten Versteegh (ENS, Paris)

Track 1

  • Thomas Schatz (ENS, Paris)
  • Roland Thiollière (EHESS, Paris)

Track 2

  • Bogdan Ludusan (EHESS, Paris)
  • Maarten Versteegh (ENS, Paris)

Contact

The organizers can be reached at zerospeech2015@gmail.com

Sponsors

ZeroSpeech2015 is funded through an ERC grant to Emmanuel Dupoux (see website).

References

Subword units/embeddings

[1] Badino, L., Canevari, C., Fadiga, L., & Metta, G. (2014). An auto-encoder based approach to unsupervised learning of subword units. in ICASSP.

[2] Huijbregts, M., McLaren, M., & van Leeuwen, D. (2011). Unsupervised acoustic sub-word unit detection for query-by-example spoken term detection. In ICASSP (pp. 4436-4439).

[3] Jansen, A., Thomas, S., & Hermansky, H. (2013). Weak top-down constraints for unsupervised acoustic model training. In ICASSP (pp. 8091-8095).

[4] Lee, C., & Glass, J. (2012). A nonparametric Bayesian approach to acoustic model discovery. In Proceedings of the 50th Annual Meeting of the Association for Computational Linguistics: Long Papers Volume 1 (pp. 40-49).

[5] Varadarajan, B., Khudanpur, S. & Dupoux, E. (2008). Unsupervised Learning of Acoustic Subword Units. In Proceedings of ACL-08: HLT, (pp 165-168) .

[6] Synnaeve, G., Schatz, T & Dupoux, E. (2014). Phonetics embedding learning with side information. in IEEE:SLT

[7] Siu, M., Gish, H., Chan, A., Belfield, W. & Lowe, S. (2014). Unsupervised training of an HMM-based self-organizing unit recognizer with applications to topic classification and keyword discovery. In Computer Speech & Language 28.1, (pp 210-223).

Spoken term discovery

[8] Jansen, A., & Van Durme, B. (2011). Efficient spoken term discovery using randomized algorithms. In IEEE ASRU Workshop (pp. 401-406).

[9] Muscariello, A., Gravier, G., & Bimbot, F. (2012). Unsupervised Motif Acquisition in Speech via Seeded Discovery and Template Matching Combination in ICASSP (Vol. 20,7, pp. 2031-2044).

[10] Park, A. S., & Glass, J. R. (2008). Unsupervised Pattern Discovery in Speech. In ICASSP, 16(1), 186-197.

[11] Zhang, Y., & Glass, J. R. (2010). Towards multi-speaker unsupervised speech pattern discovery. In ICASSP (pp. 4366-4369).

Evaluation metrics

[12] Schatz, T., Peddinti, V., Xuan-Nga, C., Bach, F., Hynek, H. & Dupoux, E. (2014). Evaluating speech features with the Minimal-Pair ABX task (II): Resistance to noise. In Interspeech.

[13] Schatz, T., Peddinti, V., Bach, F., Jansen, A., Hynek, H. & Dupoux, E. (2013). Evaluating speech features with the Minimal-Pair ABX task: Analysis of the classical MFC/PLP pipeline. In Interspeech (pp 1781-1785) .

[14] Ludusan, B., Versteegh, M., Jansen, A., Gravier, G., Cao, X.N., Johnson, M. & Dupoux, E. (2014). Bridging the gap between speech technology and natural language processing: an evaluation toolbox for term discovery systems. In Proceedings of LREC.