TR2017-150

Duration-Controlled LSTM for Polyphonic Sound Event Detection

- Hayashi, T., Watanabe, S., Toda, T., Hori, T., Le Roux, J., Takeda, K., "Duration-Controlled LSTM for Polyphonic Sound Event Detection", IEEE/ACM Transactions on Audio, Speech, and Language Processing, DOI: 10.1109/TASLP.2017.2740002, Vol. 25, No. 11, August 2017.
  BibTeX TR2017-150 PDF
  - @article{Hayashi2017aug,
  - author = {Hayashi, Tomoki and Watanabe, Shinji and Toda, Tomoki and Hori, Takaaki and Le Roux, Jonathan and Takeda, Kazuya},
  - title = {Duration-Controlled LSTM for Polyphonic Sound Event Detection},
  - journal = {IEEE/ACM Transactions on Audio, Speech, and Language Processing},
  - year = 2017,
  - volume = 25,
  - number = 11,
  - month = aug,
  - doi = {10.1109/TASLP.2017.2740002},
  - issn = {2329-9304},
  - url = {https://www.merl.com/publications/TR2017-150}
  - }
MERL Contact:
- Jonathan
  Le Roux
Research Areas:

Artificial Intelligence, Speech & Audio

Abstract:

This paper presents a new hybrid approach called duration-controlled long short-term memory (LSTM) for polyphonic Sound Event Detection (SED). It builds upon a state-ofthe-art SED method which performs frame-by-frame detection using a bidirectional LSTM recurrent neural network (BLSTM), and incorporates a duration-controlled modeling technique based on a hidden semi-Markov model (HSMM). The proposed approach makes it possible to model the duration of each sound event precisely and to perform sequence-by-sequence detection without having to resort to thresholding, as in conventional frame-by-frame methods. Furthermore, to effectively reduce sound event insertion errors, which often occur under noisy conditions, we also introduce a binary-mask-based post-processing which relies on a sound activity detection (SAD) network to identify segments with any sound event activity, an approach inspired by the well-known benefits of voice activity detection in speech recognition systems. We conduct an experiment using the DCASE2016 task 2 dataset to compare our proposed method with typical conventional methods, such as non-negative matrix factorization (NMF) and standard BLSTM. Our proposed method outperforms the conventional methods both in an event-based evaluation, achieving a 75.3% F1 score and a 44.2% error rate, and in a segment-based evaluation, achieving an 81.1% F1 score and a 32.9% error rate, outperforming the best results reported in the DCASE2016 task 2 Challenge.

MERL Contact:

JonathanLe Roux

Research Areas:

Abstract:

Jonathan
Le Roux