News & Events

EVENT John Hershey Invited to Speak at Deep Learning Summit 2016 in Boston
Date: Thursday, May 12, 2016 - Friday, May 13, 2016
Location: Deep Learning Summit, Boston, MA
Research Area: Speech & Audio
Brief
- MERL Speech and Audio Senior Team Leader John Hershey is among a set of high-profile researchers invited to speak at the Deep Learning Summit 2016 in Boston on May 12-13, 2016. John will present the team's groundbreaking work on general sound separation using a novel deep learning framework called Deep Clustering. For the first time, an artificial intelligence is able to crack the half-century-old "cocktail party problem", that is, to isolate the speech of a single person from a mixture of multiple unknown speakers, as humans do when having a conversation in a loud crowd.
TALK Advanced Recurrent Neural Networks for Automatic Speech Recognition
Date & Time: Friday, April 29, 2016; 12:00 PM - 1:00 PM
Speaker: Yu Zhang, MIT
Research Area: Speech & Audio
Abstract
- A recurrent neural network (RNN) is a class of neural network models where connections between its neurons form a directed cycle. This creates an internal state of the network which allows it to exhibit dynamic temporal behavior. Recently the RNN-based acoustic models greatly improved automatic speech recognition (ASR) accuracy on many tasks, such as an advanced version of the RNN, which exploits a structure called long-short-term memory (LSTM). However, ASR performance with distant microphones, low resources, noisy, reverberant conditions, and on multi-talker speech are still far from satisfactory as compared to humans. To address these issues, we develop new strucute of RNNs inspired by two principles: (1) the structure follows the intuition of human speech recognition; (2) the structure is easy to optimize. The talk will go beyond basic RNNs, introduce prediction-adaptation-correction RNNs (PAC-RNNs) and highway LSTMs (HLSTMs). It studies both uni-directional and bi-direcitonal RNNs and discriminative training also applied on top the RNNs. For efficient training of such RNNs, the talk will describe two algorithms for learning their parameters in some detail: (1) Latency-Controlled bi-directional model training; and (2) Two pass forward computation for sequence training. Finally, this talk will analyze the advantages and disadvantages of different variants and propose future directions.
TALK A data-centric approach to driving behavior research: How can signal processing methods contribute to the development of autonomous driving?
Date & Time: Tuesday, March 15, 2016; 12:00 PM - 12:45 PM
Speaker: Prof. Kazuya Takeda, Nagoya University
Research Area: Speech & Audio
Abstract
- Thanks to advanced "internet of things" (IoT) technologies, situation-specific human behavior has become an area of development for practical applications involving signal processing. One important area of development of such practical applications is driving behavior research. Since 1999, I have been collecting driving behavior data in a wide range of signal modalities, including speech/sound, video, physical/physiological sensors, CAN bus, LIDAR and GNSS. The objective of this data collection is to evaluate how well signal models can represent human behavior while driving. In this talk, I would like to summarize our 10 years of study of driving behavior signal processing, which has been based on these signal corpora. In particular, statistical signal models of interactions between traffic contexts and driving behavior, i.e., stochastic driver modeling, will be discussed, in the context of risky lane change detection. I greatly look forward to discussing the scalability of such corpus-based approaches, which could be applied to almost any traffic situation.
TALK Driver's mental workload estimation based on the reflex eye movement
Date & Time: Tuesday, March 15, 2016; 12:45 PM - 1:30 PM
Speaker: Prof. Hirofumi Aoki, Nagoya University
Research Area: Speech & Audio
Abstract
- Driving requires a complex skill that is involved with the vehicle itself (e.g., speed control and instrument operation), other road users (e.g., other vehicles, pedestrians), surrounding environment, and so on. During driving, visual cues are the main source to supply information to the brain. In order to stabilize the visual information when you are moving, the eyes move to the opposite direction based on the input to the vestibular system. This involuntary eye movement is called as the vestibulo-ocular reflex (VOR) and the physiological models have been studied so far. Obinata et al. found that the VOR can be used to estimate mental workload. Since then, our research group has been developing methods to quantitatively estimate mental workload during driving by means of reflex eye movement. In this talk, I will explain the basic mechanism of the reflex eye movement and how to apply for mental workload estimation. I also introduce the latest work to combine the VOR and OKR (optokinetic reflex) models for naturalistic driving environment.
TALK Emotion Detection for Health Related Issues
Date & Time: Tuesday, February 16, 2016; 12:00 PM - 1:00 PM
Speaker: Dr. Najim Dehak, MIT
Research Area: Speech & Audio
Abstract
- Recently, there has been a great increase of interest in the field of emotion recognition based on different human modalities, such as speech, heart rate etc. Emotion recognition systems can be very useful in several areas, such as medical and telecommunications. In the medical field, identifying the emotions can be an important tool for detecting and monitoring patients with mental health disorder. In addition, the identification of the emotional state from voice provides opportunities for the development of automated dialogue system capable of producing reports to the physician based on frequent phone communication between the system and the patients. In this talk, we will describe a health related application of using emotion recognition system based on human voices in order to detect and monitor the emotion state of people.
EVENT SANE 2015 - Speech and Audio in the Northeast
Date: Thursday, October 22, 2015
Location: Google, New York City, NY
MERL Contact: Jonathan Le Roux
Research Area: Speech & Audio
Brief
- SANE 2015, a one-day event gathering researchers and students in speech and audio from the Northeast of the American continent, will be held on Thursday October 22, 2015 at Google, in New York City, NY.
  
  It is a follow-up to SANE 2012, held at Mitsubishi Electric Research Labs (MERL), SANE 2013, held at Columbia University, and SANE 2014, held at MIT, which each gathered 70 to 90 researchers and students.
  
  SANE 2015 will feature invited talks by leading researchers from the Northeast, as well as from the international community: Rohit Prasad (Amazon), Michael Mandel (Brooklyn College, CUNY), Ron Weiss (Google), John Hershey (MERL), Pablo Sprechmann (NYU), Tuomas Virtanen (Tampere University of Technology), and Paris Smaragdis (UIUC). It will also feature a lively poster session during lunch time, open to both students and researchers.
  
  SANE 2015 is organized by Jonathan Le Roux (MERL), Hank Liao (Google), Andrew Senior (Google), and John R. Hershey (MERL).
EVENT SANE 2014 - Speech and Audio in the Northeast
Date: Thursday, October 23, 2014
Location: Mitsubishi Electric Research Laboratories (MERL)
MERL Contact: Jonathan Le Roux
Research Area: Speech & Audio
Brief
- SANE 2014, a one-day event gathering researchers and students in speech and audio from the Northeast of the American continent, will be held on Thursday October 23, 2014 at MIT, in Cambridge, MA. It is a follow-up to SANE 2012, held at Mitsubishi Electric Research Labs (MERL), and SANE 2013, held at Columbia University, which each gathered around 70 researchers and students. SANE 2014 will feature invited talks by leading researchers from the Northeast as well as Europe: Najim Dehak (MIT), Hakan Erdogan (MERL/Sabanci University), Gael Richard (Telecom ParisTech), George Saon (IBM Research), Andrew Senior (Google Research), Stavros Tsakalidis (BBN - Raytheon), and David Wingate (Lyric). It will also feature a lively poster session during lunch time, open to both students and researchers. SANE 2014 is organized by Jonathan Le Roux (MERL), Jim Glass (MIT), and John R. Hershey (MERL).
EVENT SANE 2013 - Speech and Audio in the Northeast
Date & Time: Thursday, October 24, 2013; 8:45 AM - 5:00 PM
Location: Columbia University
MERL Contact: Jonathan Le Roux
Research Area: Speech & Audio
Brief
- SANE 2013, a one-day event gathering researchers and students in speech and audio from the Northeast of the American continent, will be held on Thursday October 24, 2013 at Columbia University, in New York City.
  
  A follow-up to SANE 2012 held in October 2012 at MERL in Cambridge, MA, this year's SANE will be held in conjunction with the WASPAA workshop, held October 20-23 in upstate New York. WASPAA attendees are welcome and encouraged to attend SANE.
  
  SANE 2013 will feature invited speakers from the Northeast, as well as from the international community. It will also feature a lively poster session during lunch time, open to both students and researchers.
  
  SANE 2013 is organized by Prof. Dan Ellis (Columbia University), Jonathan Le Roux (MERL) and John R. Hershey (MERL).
TALK Efficiently sampling wave fields
Date & Time: Thursday, October 17, 2013; 12:00 PM
Speaker: Prof. Laurent Daudet, Paris Diderot University, France
MERL Host: Jonathan Le Roux
Research Area: Speech & Audio
Abstract
- In acoustics, one may wish to acquire a wavefield over a whole spatial domain, while we can only make point measurements (ie, with microphones). Even with few sources, this remains a difficult problem because of reverberation, which can be hard to characterize. This can be seen as a sampling / interpolation problem, and it raises a number of interesting questions: how many sample points are needed, where to choose the sampling points, etc. In this presentation, we will review some case studies, in 2D (vibrating plates) and 3D (room acoustics), with numerical and experimental data, where we have developed sparse models, possibly with additional 'structures', based on a physical modeling of the acoustic field. These type of models are well suited to reconstruction techniques known as compressed sensing. These principles can also be used for sub-nyquist optical imaging : we will show preliminary experimental results of a new compressive imager, remarkably simple in its principle, using a multiply scattering medium.
EVENT CHiME 2013 - The 2nd International Workshop on Machine Listening in Multisource Environments
Date & Time: Saturday, June 1, 2013; 9:00 AM - 6:00 PM
Location: Vancouver, Canada
MERL Contact: Jonathan Le Roux
Research Area: Speech & Audio
Brief
- MERL researchers Shinji Watanabe and Jonathan Le Roux are members of the organizing committee of CHiME 2013, the 2nd International Workshop on Machine Listening in Multisource Environments, Jonathan acting as Program Co-Chair. MERL is also a sponsor for the event.
  
  CHiME 2013 is a one-day workshop to be held in conjunction with ICASSP 2013 that will consider the challenge of developing machine listening applications for operation in multisource environments, i.e. real-world conditions with acoustic clutter, where the number and nature of the sound sources is unknown and changing over time. CHiME brings together researchers from a broad range of disciplines (computational hearing, blind source separation, speech recognition, machine learning) to discuss novel and established approaches to this problem. The cross-fertilisation of ideas will foster fresh approaches that efficiently combine the complementary strengths of each research field.
EVENT ICASSP 2013 - Student Career Luncheon
Date & Time: Thursday, May 30, 2013; 12:30 PM - 2:30 PM
Location: Vancouver, Canada
MERL Contacts: Anthony Vetro; Petros T. Boufounos; Jonathan Le Roux
Research Area: Speech & Audio
Brief
- MERL is a sponsor for the first ICASSP Student Career Luncheon that will take place at ICASSP 2013. MERL members will take part in the event to introduce MERL and talk with students interested in positions or internships.
TALK Practical kernel methods for automatic speech recognition
Date & Time: Tuesday, May 7, 2013; 2:30 PM
Speaker: Dr. Yotaro Kubo, NTT Communication Science Laboratories, Kyoto, Japan
Research Area: Speech & Audio
Abstract
- Kernel methods are important to realize both convexity in estimation and ability to represent nonlinear classification. However, in automatic speech recognition fields, kernel methods are not widely used conventionally. In this presentation, I will introduce several attempts to practically incorporate kernel methods into acoustic models for automatic speech recognition. The presentation will consist of two parts. The first part will describes maximum entropy discrimination and its application to a kernel machine training. The second part will describes dimensionality reduction of kernel-based features.
TALK Probabilistic Latent Tensor Factorisation
Date & Time: Tuesday, February 26, 2013; 12:00 PM
Speaker: Prof. Taylan Cemgil, Bogazici University, Istanbul, Turkey
MERL Host: Jonathan Le Roux
Research Area: Speech & Audio
Abstract
- Algorithms for decompositions of matrices are of central importance in machine learning, signal processing and information retrieval, with SVD and NMF (Nonnegative Matrix Factorisation) being the most widely used examples. Probabilistic interpretations of matrix factorisation models are also well known and are useful in many applications (Salakhutdinov and Mnih 2008; Cemgil 2009; Fevotte et. al. 2009). In the recent years, decompositions of multiway arrays, known as tensor factorisations have gained significant popularity for the analysis of large data sets with more than two entities (Kolda and Bader, 2009; Cichocki et. al. 2008). We will discuss a subset of these models from a statistical modelling perspective, building upon probabilistic Bayesian generative models and generalised linear models (McCulloch and Nelder). In both views, the factorisation is implicit in a well-defined hierarchical statistical model and factorisations can be computed via maximum likelihood.
  
  We express a tensor factorisation model using a factor graph and the factor tensors are optimised iteratively. In each iteration, the update equation can be implemented by a message passing algorithm, reminiscent to variable elimination in a discrete graphical model. This setting provides a structured and efficient approach that enables very easy development of application specific custom models, as well as algorithms for the so called coupled (collective) factorisations where an arbitrary set of tensors are factorised simultaneously with shared factors. Extensions to full Bayesian inference for model selection, via variational approximations or MCMC are also feasible. Well known models of multiway analysis such as Nonnegative Matrix Factorisation (NMF), Parafac, Tucker, and audio processing (Convolutive NMF, NMF2D, SF-SSNTF) appear as special cases and new extensions can easily be developed. We will illustrate the approach with applications in link prediction and audio and music processing.
TALK Bayesian Group Sparse Learning
Date & Time: Monday, January 28, 2013; 11:00 AM
Speaker: Prof. Jen-Tzung Chien, National Chiao Tung University, Taiwan
Research Area: Speech & Audio
Abstract
- Bayesian learning provides attractive tools to model, analyze, search, recognize and understand real-world data. In this talk, I will introduce a new Bayesian group sparse learning and its application on speech recognition and signal separation. First of all, I present the group sparse hidden Markov models (GS-HMMs) where a sequence of acoustic features is driven by Markov chain and each feature vector is represented by two groups of basis vectors. The features across states and within states are represented accordingly. The sparse prior is imposed by introducing the Laplacian scale mixture (LSM) distribution. The robustness of speech recognition is illustrated. On the other hand, the LSM distribution is also incorporated into Bayesian group sparse learning based on the nonnegative matrix factorization (NMF). This approach is developed to estimate the reconstructed rhythmic and harmonic music signals from single-channel source signal. The Monte Carlo procedure is presented to infer two groups of parameters. The future work of Bayesian learning shall be discussed.
TALK Speech recognition for closed-captioning
Date & Time: Tuesday, December 11, 2012; 12:00 PM
Speaker: Takahiro Oku, NHK Science & Technology Research Laboratories
Research Area: Speech & Audio
Abstract
- In this talk, I will present human-friendly broadcasting research conducted in NHK and research on speech recognition for real-time closed-captioning. The goal of human-friendly broadcasting research is to make broadcasting more accessible and enjoyable for everyone, including children, elderly, and physically challenged persons. The automatic speech recognition technology that NHK has developed makes it possible to create captions for the hearing impaired in real-time automatically. For sports programs such as professional sumo wrestling, a closed-captioning system has already been implemented in which captions are created by using speech recognition on a captioning re-speaker. In 2011, NHK General Television started broadcasting of closed captions for the information program "Morning Market". After the introduction of the implemented closed-captioning system, I will talk about our recent improvement obtained by an adaptation method that creates a more effective acoustic model using error correction results. The method reflects recognition error tendencies more effectively.
TALK Understanding Audition via Sound Analysis and Synthesis
Date & Time: Wednesday, October 24, 2012; 11:45 AM
Speaker: Josh McDermott, MIT, BCS
MERL Host: Jonathan Le Roux
Research Area: Speech & Audio
EVENT SANE 2012 - Speech and Audio in the Northeast
Date & Time: Wednesday, October 24, 2012; 8:30 AM - 5:00 PM
Location: MERL
MERL Contact: Jonathan Le Roux
Research Area: Speech & Audio
Brief
- SANE 2012, a one-day event gathering researchers and students in speech and audio from the northeast of the American continent, will be held on Wednesday October 24, 2012 at Mitsubishi Electric Research Laboratories (MERL) in Cambridge, MA.
TALK Latent Topic Modeling of Conversational Speech
Date & Time: Wednesday, October 24, 2012; 1:30 PM
Speaker: Dr. Timothy J. Hazen and David Harwath, MIT Lincoln Labs / MIT CSAIL
MERL Host: Jonathan Le Roux
Research Area: Speech & Audio
TALK Zero-Resource Speech Pattern and Sub-Word Unit Discovery
Date & Time: Wednesday, October 24, 2012; 9:10 AM
Speaker: Prof. Jim Glass and Chia-ying Lee, MIT CSAIL
MERL Host: Jonathan Le Roux
Research Area: Speech & Audio
TALK Recognizing and Classifying Environmental Sounds
Date & Time: Wednesday, October 24, 2012; 11:00 AM
Speaker: Prof. Dan Ellis, Columbia University
MERL Host: Jonathan Le Roux
Research Area: Speech & Audio
TALK A new class of dynamical system models for speech and audio
Date & Time: Wednesday, October 24, 2012; 4:05 PM
Speaker: Dr. John R. Hershey, MERL
MERL Host: Jonathan Le Roux
Research Area: Speech & Audio
TALK Factorial Hidden Restricted Boltzmann Machines for Noise Robust Speech Recognition
Date & Time: Wednesday, October 24, 2012; 3:20 PM
Speaker: Dr. Steven J. Rennie, IBM Research
MERL Host: Jonathan Le Roux
Research Area: Speech & Audio
TALK Advances in Acoustic Modeling at IBM Research: Deep Belief Networks, Sparse Representations
Date & Time: Wednesday, October 24, 2012; 9:55 AM
Speaker: Dr. Tara Sainath, IBM Research
MERL Host: Jonathan Le Roux
Research Area: Speech & Audio
TALK Self-Organizing Units (SOUs): Training Speech Recognizers Without Any Transcribed Audio
Date & Time: Wednesday, October 24, 2012; 2:15 PM
Speaker: Dr. Herb Gish, BBN - Raytheon
MERL Host: Jonathan Le Roux
Research Area: Speech & Audio
TALK Non-negative Hidden Markov Modeling of Audio
Date & Time: Thursday, October 11, 2012; 2:30 PM
Speaker: Dr. Gautham J. Mysore, Adobe
Research Area: Speech & Audio
Abstract
- Non-negative spectrogram factorization techniques have become quite popular in the last decade as they are effective in modeling the spectral structure of audio. They have been extensively used for applications such as source separation and denoising. These techniques however fail to account for non-stationarity and temporal dynamics, which are two important properties of audio. In this talk, I will introduce the non-negative hidden Markov model (N-HMM) and the non-negative factorial hidden Markov model (N-FHMM) to model single sound sources and sound mixtures respectively. They jointly model the spectral structure and temporal dynamics of sound sources, while accounting for non-stationarity. I will also discuss the application of these models to various applications such as source separation, denoising, and content based audio processing, showing why they yield improved performance when compared to non-negative spectrogram factorization techniques.