TR2021-037

Semi-Supervised Speech Recognition via Graph-Based Temporal Classification

- Moritz, N., Hori, T., Le Roux, J., "Semi-Supervised Speech Recognition via Graph-Based Temporal Classification", IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), DOI: 10.1109/ICASSP39728.2021.9414058, June 2021, pp. 6548-6552.
  BibTeX TR2021-037 PDF
  - @inproceedings{Moritz2021jun2,
  - author = {Moritz, Niko and Hori, Takaaki and {Le Roux}, Jonathan},
  - title = {{Semi-Supervised Speech Recognition via Graph-Based Temporal Classification}},
  - booktitle = {IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP)},
  - year = 2021,
  - pages = {6548--6552},
  - month = jun,
  - doi = {10.1109/ICASSP39728.2021.9414058},
  - url = {https://www.merl.com/publications/TR2021-037}
  - }
MERL Contact:
- Jonathan
  Le Roux
Research Areas:

Artificial Intelligence, Machine Learning, Speech & Audio

Abstract:

Semi-supervised learning has demonstrated promising results in automatic speech recognition (ASR) by self-training using a seed ASR model with pseudo-labels generated for unlabeled data. The effectiveness of this approach largely relies on the pseudo-label accuracy, for which typically only the 1-best ASR hypothesis is used. However, alternative ASR hypotheses of an N-best list can provide more accurate labels for an unlabeled speech utterance and also reflect uncertainties of the seed ASR model. In this paper, we propose a generalized form of the connectionist temporal classification (CTC) objective that accepts a graph representation of the training labels. The newly proposed graph-based temporal classification (GTC) objective is applied for self-training with WFST-based supervision, which is generated from an N-best list of pseudo-labels. In this setup, GTC is used to learn not only a temporal alignment, similarly to CTC, but also a label alignment to obtain the optimal pseudo-label sequence from the weighted graph. Results show that this approach can effectively exploit an N-best list of pseudo-labels with associated scores, considerably outperforming standard pseudo-labeling, with ASR results approaching an oracle experiment in which the best hypotheses of the N-best lists are selected manually.

Related Publication

Moritz, N., Hori, T., Le Roux, J., "Semi-Supervised Speech Recognition Via Graph-Based Temporal Classification", arXiv, October 2020.

BibTeX arXiv

@article{Moritz2020oct2,
author = {Moritz, Niko and Hori, Takaaki and {Le Roux}, Jonathan},
title = {{Semi-Supervised Speech Recognition Via Graph-Based Temporal Classification}},
journal = {arXiv},
year = 2020,
month = oct,
url = {https://arxiv.org/abs/2010.15653}
}

MERL Contact:

JonathanLe Roux

Research Areas:

Abstract:

Jonathan
Le Roux