TR2017-181

Multi-level Language Modeling and Decoding for Open Vocabulary End-to-End Speech Recognition


    •  Hori, T., Watanabe, S., Hershey, J.R., "Multi-level Language Modeling and Decoding for Open Vocabulary End-to-End Speech Recognition", IEEE Workshop on Automatic Speech Recognition and Understanding (ASRU), DOI: 10.1109/​ASRU.2017.8268948, December 2017.
      BibTeX TR2017-181 PDF
      • @inproceedings{Hori2017dec,
      • author = {Hori, Takaaki and Watanabe, Shinji and Hershey, John R.},
      • title = {Multi-level Language Modeling and Decoding for Open Vocabulary End-to-End Speech Recognition},
      • booktitle = {IEEE Workshop on Automatic Speech Recognition and Understanding (ASRU)},
      • year = 2017,
      • month = dec,
      • doi = {10.1109/ASRU.2017.8268948},
      • url = {https://www.merl.com/publications/TR2017-181}
      • }
  • Research Areas:

    Artificial Intelligence, Speech & Audio

Abstract:

We propose a combination of character-based and word-based language models in an end-to-end automatic speech recognition (ASR) architecture. In our prior work, we combined a character-based LSTM RNN-LM with a hybrid attention/connectionist temporal classification (CTC) architecture. The character LMs improved recognition accuracy to rival state-of-the-art DNN/HMM systems in Japanese and Mandarin Chinese tasks. Although a characterbased architecture can provide for open vocabulary recognition, the character-based LMs generally under-perform relative to word LMs for languages such as English with a small alphabet, because of the difficulty of modeling linguistic constraints across long sequences of characters. This paper presents a novel method for end-to-end ASR decoding with LMs at both the character and word level. Hypotheses are first scored with the character-based LM until a word boundary is encountered. Known words are then re-scored using the word-based LM, while the character-based LM provides for out-of-vocabulary scores. In a standard Wall Street Journal (WSJ) task, we achieved 5.6 % WER for the Eval'92 test set using only the SI284 training set and WSJ text data, which is the best score reported for end-to-end ASR systems on this benchmark.

 

  • Related News & Events

    •  NEWS    MERL presents 3 papers at ASRU 2017, John Hershey serves as general chair
      Date: December 16, 2017 - December 20, 2017
      Where: Okinawa, Japan
      MERL Contacts: Chiori Hori; Jonathan Le Roux
      Research Area: Speech & Audio
      Brief
      • MERL presented three papers at the 2017 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU), which was held in Okinawa, Japan from December 16-20, 2017. ASRU is the premier speech workshop, bringing together researchers from academia and industry in an intimate and collegial setting. More than 270 people attended the event this year, a record number. MERL's Speech and Audio Team was a key part of the organization of the workshop, with John Hershey serving as General Chair, Chiori Hori as Sponsorship Chair, and Jonathan Le Roux as Demonstration Chair. Two of the papers by MERL were selected among the 10 finalists for the best paper award. Mitsubishi Electric and MERL were also Platinum sponsors of the conference, with MERL awarding the MERL Best Student Paper Award.
    •