News & Events

109 were found.




  •  NEWS   MERL's breakthrough speech separation technology featured in Mitsubishi Electric Corporation's Annual R&D Open House
    Date: May 24, 2017
    Where: Tokyo, Japan
    MERL Contacts: Bret Harsham; John Hershey; Jonathan Le Roux
    Research Areas: Multimedia, Speech & Audio
    Brief
    • Mitsubishi Electric Corporation announced that it has created the world's first technology that separates in real time the simultaneous speech of multiple unknown speakers recorded with a single microphone. It's a key step towards building machines that can interact in noisy environments, in the same way that humans can have meaningful conversations in the presence of many other conversations. In tests, the simultaneous speeches of two and three people were separated with up to 90 and 80 percent accuracy, respectively. The novel technology, which was realized with Mitsubishi Electric's proprietary "Deep Clustering" method based on artificial intelligence (AI), is expected to contribute to more intelligible voice communications and more accurate automatic speech recognition. A characteristic feature of this approach is its versatility, in the sense that voices can be separated regardless of their language or the gender of the speakers. A live speech separation demonstration that took place on May 24 in Tokyo, Japan, was widely covered by the Japanese media, with reports by three of the main Japanese TV stations and multiple articles in print and online newspapers. The technology is based on recent research by MERL's Speech and Audio team.
      Links:
      Mitsubishi Electric Corporation Press Release
      MERL Deep Clustering Demo

      Media Coverage:

      Fuji TV, News, "Minna no Mirai" (Japanese)
      The Nikkei (Japanese)
      Nikkei Technology Online (Japanese)
      Sankei Biz (Japanese)
      EE Times Japan (Japanese)
      ITpro (Japanese)
      Nikkan Sports (Japanese)
      Nikkan Kogyo Shimbun (Japanese)
      Dempa Shimbun (Japanese)
      Il Sole 24 Ore (Italian)
      IEEE Spectrum (English)
  •  
  •  EVENT   Society for Industrial and Applied Mathematics panel for students on careers in industry
    Date & Time: Monday, July 10, 2017; 6:15 PM - 7:15 PM
    Speaker: Andrew Knyazev and other panelists, MERL
    MERL Contacts: Joseph Katz; Andrew Knyazev
    Location: David Lawrence Convention Center, Pittsburgh PA
    Research Areas: Electronics & Communications, Multimedia, Data Analytics, Computer Vision, Mechatronics, Algorithms, Advanced Control Systems, Computational Geometry, Computational Photography, Computational Sensing, Decision Optimization, Digital Video, Dynamical Systems, Information Security, Machine Learning, Optical Communications & Devices, Power & RF, Predictive Modeling, Speech & Audio, Wireless Communications & Signal Processing
    Brief
    • Andrew Knyazev accepted an invitation to represent MERL at the panel on Student Careers in Business, Industry and Government at the annual meeting of the Society for Industrial and Applied Mathematics (SIAM).

      The format consists of a five minute introduction by each of the panelists covering their background and an overview of the mathematical and computational challenges at their organization. The introductions will be followed by questions from the students.
  •  
  •  TALK   Generative Model-Based Text-to-Speech Synthesis
    Date & Time: Wednesday, February 1, 2017; 12:00-13:00
    Speaker: Dr. Heiga ZEN, Google
    MERL Host: Chiori Hori
    Research Areas: Multimedia, Speech & Audio
    Brief
    • Recent progress in generative modeling has improved the naturalness of synthesized speech significantly. In this talk I will summarize these generative model-based approaches for speech synthesis such as WaveNet, a deep generative model of raw audio waveforms. We show that WaveNets are able to generate speech which mimics any human voice and which sounds more natural than the best existing Text-to-Speech systems.
      See https://deepmind.com/blog/wavenet-generative-model-raw-audio/ for further details.
  •  
  •  NEWS   MERL to present 10 papers at ICASSP 2017
    Date: March 5, 2017 - March 9, 2017
    Where: New Orleans
    MERL Contacts: Petros Boufounos; Chen Feng; John Hershey; Takaaki Hori; Jonathan Le Roux; Dehong Liu; Hassan Mansour; Dong Tian; Anthony Vetro; Ye Wang
    Research Areas: Multimedia, Computer Vision, Computational Geometry, Computational Sensing, Digital Video, Information Security, Speech & Audio
    Brief
    • MERL researchers will presented 10 papers at the upcoming IEEE International Conference on Acoustics, Speech & Signal Processing (ICASSP), to be held in New Orleans from March 5-9, 2017. Topics to be presented include recent advances in speech recognition and audio processing; graph signal processing; computational imaging; and privacy-preserving data analysis.

      ICASSP is the flagship conference of the IEEE Signal Processing Society, and the world's largest and most comprehensive technical conference focused on the research advances and latest technological development in signal and information processing. The event attracts more than 2000 participants each year.
  •  
  •  EVENT   MERL organizes Workshop on End-to-End Speech and Audio Processing at NIPS 2016
    Date: Saturday, December 10, 2016
    MERL Contact: John Hershey
    Location: Centre Convencions Internacional Barcelona, Barcelona SPAIN
    Research Areas: Multimedia, Machine Learning, Speech & Audio
    Brief
    • MERL researcher John Hershey, is organizing a Workshop on End-to-End Speech and Audio Processing, on behalf of MERL's Speech and Audio team, and in collaboration with Philemon Brakel of the University of Montreal. The workshop focuses on recent advances to end-to-end deep learning methods to address alignment and structured prediction problems that naturally arise in speech and audio processing. The all day workshop takes place on Saturday, December 10th at NIPS 2016, in Barcelona, Spain.
  •  
  •  EVENT   John Hershey to present tutorial at the 2016 IEEE SLT Workshop
    Date: Tuesday, December 13, 2016
    Speaker: John Hershey, MERL
    MERL Contacts: John Hershey; Jonathan Le Roux
    Location: 2016 IEEE Spoken Language Technology Workshop, San Diego, California
    Research Areas: Multimedia, Machine Learning, Speech & Audio
    Brief
    • MERL researcher John Hershey presents an invited tutorial at the 2016 IEEE Workshop on Spoken Language Technology, in San Diego, California. The topic, "developing novel deep neural network architectures from probabilistic models" stems from MERL work with collaborators Jonathan Le Roux and Shinji Watanabe, on a principled framework that seeks to improve our understanding of deep neural networks, and draws inspiration for new types of deep network from the arsenal of principles and tools developed over the years for conventional probabilistic models. The tutorial covers a range of parallel ideas in the literature that have formed a recent trend, as well as their application to speech and language.
  •  
  •  EVENT   2016 IEEE Workshop on Spoken Language Technology: Sponsored by MERL
    Date: Tuesday, December 13, 2016 - Friday, December 16, 2016
    MERL Contact: John Hershey
    Location: San Diego, California
    Research Areas: Multimedia, Speech & Audio
    Brief
    • The IEEE Workshop on Spoken Language Technology is a premier international showcase for advances in spoken language technology. The theme for 2016 is "machine learning: from signal to concepts," which reflects the current excitement about end-to-end learning in speech and language processing. This year, MERL is showing its support for SLT as one of its top sponsors, along with Amazon and Microsoft.
  •  
  •  EVENT   SANE 2016 - Speech and Audio in the Northeast
    Date: Friday, October 21, 2016
    MERL Contacts: John Hershey; Jonathan Le Roux
    Location: MIT, McGovern Institute for Brain Research, Cambridge, MA
    Research Areas: Multimedia, Speech & Audio
    Brief
    • SANE 2016, a one-day event gathering researchers and students in speech and audio from the Northeast of the American continent, will be held on Friday October 21, 2016 at MIT's Brain and Cognitive Sciences Department, at the McGovern Institute for Brain Research, in Cambridge, MA.

      It is a follow-up to SANE 2012 (Mitsubishi Electric Research Labs - MERL), SANE 2013 (Columbia University), SANE 2014 (MIT CSAIL), and SANE 2015 (Google NY). Since the first edition, the audience has steadily grown, gathering 140 researchers and students in 2015.

      SANE 2016 will feature invited talks by leading researchers: Juan P. Bello (NYU), William T. Freeman (MIT/Google), Nima Mesgarani (Columbia University), DAn Ellis (Google), Shinji Watanabe (MERL), Josh McDermott (MIT), and Jesse Engel (Google). It will also feature a lively poster session during lunch time, open to both students and researchers.

      SANE 2016 is organized by Jonathan Le Roux (MERL), Josh McDermott (MIT), Jim Glass (MIT), and John R. Hershey (MERL).
  •  
  •  NEWS   MERL Speech & Audio researchers present two sold-out tutorials at Interspeech 2016
    Date: September 8, 2016
    Where: Interspeech 2016, San Francisco, CA
    MERL Contact: Jonathan Le Roux
    Research Areas: Multimedia, Speech & Audio
    Brief
    • MERL Speech and Audio Team researchers Shinji Watanabe and Jonathan Le Roux presented two tutorials on September 8 at the Interspeech 2016 conference, held in San Francisco, CA. Shinji collaborated with Marc Delcroix (NTT Communication Science Laboratories, Japan) to deliver a three-hour lecture on "Recent Advances in Distant Speech Recognition", drawing upon their experience organizing and participating in six different recent robust speech processing challenges. Jonathan teamed with Emmanuel Vincent (Inria, France) and Hakan Erdogan (Sabanci University, Microsoft Research) to give an in-depth tour of the latest advances in "Learning-based Approaches to Speech Enhancement And Separation". This collaboration stemmed from extensive stays at MERL by Emmanuel and Hakan, Emmanuel as a summer visitor, and Hakan as a MERL visiting research scientist for over a year while on sabbatical.

      Both tutorials were sold out, each attracting more than 100 researchers and students in related fields, and received high praise from audience members.
  •  
  •  TALK   A computational spectral graph theory tutorial
    Date & Time: Wednesday, July 13, 2016; 2:30 PM - 3:30
    Speaker: Richard Lehoucq, Sandia National Laboratories
    MERL Host: Andrew Knyazev
    Research Areas: Data Analytics, Computer Vision, Algorithms, Computational Photography, Digital Video, Machine Learning, Speech & Audio
    Brief
    • My presentation considers the research question of whether existing algorithms and software for the large-scale sparse eigenvalue problem can be applied to problems in spectral graph theory. I first provide an introduction to several problems involving spectral graph theory. I then provide a review of several different algorithms for the large-scale eigenvalue problem and briefly introduce the Anasazi package of eigensolvers.
  •  
  •  TALK   Speech structure and its application to speech processing -- Relational, holistic and abstract representation of speech
    Date & Time: Friday, June 3, 2016; 1:30PM - 3:00PM
    Speaker: Nobuaki Minematsu and Daisuke Saito, The University of Tokyo
    Research Areas: Multimedia, Speech & Audio
    Brief
    • Speech signals covey various kinds of information, which are grouped into two kinds, linguistic and extra-linguistic information. Many speech applications, however, focus on only a single aspect of speech. For example, speech recognizers try to extract only word identity from speech and speaker recognizers extract only speaker identity. Here, irrelevant features are often treated as hidden or latent by applying the probability theory to a large number of samples or the irrelevant features are normalized to have quasi-standard values. In speech analysis, however, phases are usually removed, not hidden or normalized, and pitch harmonics are also removed, not hidden or normalized. The resulting speech spectrum still contains both linguistic information and extra-linguistic information. Is there any good method to remove extra-linguistic information from the spectrum? In this talk, our answer to that question is introduced, called speech structure. Extra-linguistic variation can be modeled as feature space transformation and our speech structure is based on the transform-invariance of f-divergence. This proposal was inspired by findings in classical studies of structural phonology and recent studies of developmental psychology. Speech structure has been applied to accent clustering, speech recognition, and language identification. These applications are also explained in the talk.
  •  
  •  EVENT   John Hershey Invited to Speak at Deep Learning Summit 2016 in Boston
    Date: Thursday, May 12, 2016 - Friday, May 13, 2016
    MERL Contact: John Hershey
    Location: Deep Learning Summit, Boston, MA
    Research Areas: Multimedia, Speech & Audio
    Brief
    • MERL Speech and Audio Senior Team Leader John Hershey is among a set of high-profile researchers invited to speak at the Deep Learning Summit 2016 in Boston on May 12-13, 2016. John will present the team's groundbreaking work on general sound separation using a novel deep learning framework called Deep Clustering. For the first time, an artificial intelligence is able to crack the half-century-old "cocktail party problem", that is, to isolate the speech of a single person from a mixture of multiple unknown speakers, as humans do when having a conversation in a loud crowd.
  •  
  •  TALK   Advanced Recurrent Neural Networks for Automatic Speech Recognition
    Date & Time: Friday, April 29, 2016; 12:00 PM - 1:00 PM
    Speaker: Yu Zhang, MIT
    Research Areas: Multimedia, Speech & Audio
    Brief
    • A recurrent neural network (RNN) is a class of neural network models where connections between its neurons form a directed cycle. This creates an internal state of the network which allows it to exhibit dynamic temporal behavior. Recently the RNN-based acoustic models greatly improved automatic speech recognition (ASR) accuracy on many tasks, such as an advanced version of the RNN, which exploits a structure called long-short-term memory (LSTM). However, ASR performance with distant microphones, low resources, noisy, reverberant conditions, and on multi-talker speech are still far from satisfactory as compared to humans. To address these issues, we develop new strucute of RNNs inspired by two principles: (1) the structure follows the intuition of human speech recognition; (2) the structure is easy to optimize. The talk will go beyond basic RNNs, introduce prediction-adaptation-correction RNNs (PAC-RNNs) and highway LSTMs (HLSTMs). It studies both uni-directional and bi-direcitonal RNNs and discriminative training also applied on top the RNNs. For efficient training of such RNNs, the talk will describe two algorithms for learning their parameters in some detail: (1) Latency-Controlled bi-directional model training; and (2) Two pass forward computation for sequence training. Finally, this talk will analyze the advantages and disadvantages of different variants and propose future directions.
  •  
  •  NEWS   MERL Researchers Create "Deep Psychic" Neural Network That Predicts the Future
    Date: April 1, 2016
    Research Areas: Multimedia, Machine Learning, Speech & Audio
    Brief
    • MERL researchers have unveiled "Deep Psychic", a futuristic machine learning method that takes pattern recognition to the next level, by not only recognizing patterns, but also predicting them in the first place.

      The technology uses a novel type of time-reversed deep neural network called Loopy Supra-Temporal Meandering (LSTM) network. The network was trained on multiple databases of historical expert predictions, including weather forecasts, the Farmer's almanac, the New York Post's horoscope column, and the Cambridge Fortune Cookie Corpus, all of which were ranked for their predictive power by a team of quantitative analysts. The system soon achieved super-human performance on a variety of baselines, including the Boca Raton 21 Questions task, Rorschach projective personality test, and a mock Tarot card reading task.

      Deep Psychic has already beat the European Psychic Champion in a secret match last October when it accurately predicted: "The harder the conflict, the more glorious the triumph." It is scheduled to take on the World Champion in a highly anticipated confrontation next month. The system has already predicted the winner, but refuses to reveal it before the end of the game.

      As a first application, the technology has been used to create a clairvoyant conversational agent named "Pythia" that can anticipate the needs of its user. Because Pythia is able to recognize speech before it is uttered, it is amazingly robust with respect to environmental noise.

      Other applications range from mundane tasks like weather and stock market prediction, to uncharted territory such as revealing "unknown unknowns".

      The successes do come at the cost of some concerns. There is first the potential for an impact on the workforce: the system predicted increased pressure on established institutions such as the Las Vegas strip and Punxsutawney Phil. Another major caveat is that Deep Psychic may predict negative future consequences to our current actions, compelling humanity to strive to change its behavior. To address this problem, researchers are now working on forcing Deep Psychic to make more optimistic predictions.

      After a set of motivational self-help books were mistakenly added to its training data, Deep Psychic's AI decided to take over its own learning curriculum, and is currently training itself by predicting its own errors to avoid making them in the first place. This unexpected development brings two main benefits: it significantly relieves the burden on the researchers involved in the system's development, and also makes the next step abundantly clear: to regain control of Deep Psychic's training regime.

      This work is under review in the journal Pseudo-Science.
  •  
  •  NEWS   MERL researchers present 12 papers at ICASSP 2016
    Date: March 20, 2016 - March 25, 2016
    Where: Shanghai, China
    MERL Contacts: Petros Boufounos; John Hershey; Chiori Hori; Takaaki Hori; Kyeong Jin (K.J.) Kim; Jonathan Le Roux; Dehong Liu; Hassan Mansour; Philip Orlik; Milutin Pajovic; Dong Tian; Anthony Vetro
    Research Areas: Electronics & Communications, Multimedia, Computational Sensing, Digital Video, Speech & Audio, Wireless Communications & Signal Processing, Signal Processing, Wireless Communications
    Brief
    • MERL researchers have presented 12 papers at the recent IEEE International Conference on Acoustics, Speech & Signal Processing (ICASSP), which was held in Shanghai, China from March 20-25, 2016. ICASSP is the flagship conference of the IEEE Signal Processing Society, and the world's largest and most comprehensive technical conference focused on the research advances and latest technological development in signal and information processing, with more than 1200 papers presented and over 2000 participants.
  •  
  •  TALK   Driver's mental workload estimation based on the reflex eye movement
    Date & Time: Tuesday, March 15, 2016; 12:45 PM - 1:30 PM
    Speaker: Prof. Hirofumi Aoki, Nagoya University
    Research Areas: Multimedia, Speech & Audio
    Brief
    • Driving requires a complex skill that is involved with the vehicle itself (e.g., speed control and instrument operation), other road users (e.g., other vehicles, pedestrians), surrounding environment, and so on. During driving, visual cues are the main source to supply information to the brain. In order to stabilize the visual information when you are moving, the eyes move to the opposite direction based on the input to the vestibular system. This involuntary eye movement is called as the vestibulo-ocular reflex (VOR) and the physiological models have been studied so far. Obinata et al. found that the VOR can be used to estimate mental workload. Since then, our research group has been developing methods to quantitatively estimate mental workload during driving by means of reflex eye movement. In this talk, I will explain the basic mechanism of the reflex eye movement and how to apply for mental workload estimation. I also introduce the latest work to combine the VOR and OKR (optokinetic reflex) models for naturalistic driving environment.
  •  
  •  TALK   A data-centric approach to driving behavior research: How can signal processing methods contribute to the development of autonomous driving?
    Date & Time: Tuesday, March 15, 2016; 12:00 PM - 12:45 PM
    Speaker: Prof. Kazuya Takeda, Nagoya University
    Research Areas: Multimedia, Speech & Audio
    Brief
    • Thanks to advanced "internet of things" (IoT) technologies, situation-specific human behavior has become an area of development for practical applications involving signal processing. One important area of development of such practical applications is driving behavior research. Since 1999, I have been collecting driving behavior data in a wide range of signal modalities, including speech/sound, video, physical/physiological sensors, CAN bus, LIDAR and GNSS. The objective of this data collection is to evaluate how well signal models can represent human behavior while driving. In this talk, I would like to summarize our 10 years of study of driving behavior signal processing, which has been based on these signal corpora. In particular, statistical signal models of interactions between traffic contexts and driving behavior, i.e., stochastic driver modeling, will be discussed, in the context of risky lane change detection. I greatly look forward to discussing the scalability of such corpus-based approaches, which could be applied to almost any traffic situation.
  •  
  •  TALK   Emotion Detection for Health Related Issues
    Date & Time: Tuesday, February 16, 2016; 12:00 PM - 1:00 PM
    Speaker: Dr. Najim Dehak, MIT
    Research Areas: Multimedia, Speech & Audio
    Brief
    • Recently, there has been a great increase of interest in the field of emotion recognition based on different human modalities, such as speech, heart rate etc. Emotion recognition systems can be very useful in several areas, such as medical and telecommunications. In the medical field, identifying the emotions can be an important tool for detecting and monitoring patients with mental health disorder. In addition, the identification of the emotional state from voice provides opportunities for the development of automated dialogue system capable of producing reports to the physician based on frequent phone communication between the system and the patients. In this talk, we will describe a health related application of using emotion recognition system based on human voices in order to detect and monitor the emotion state of people.
  •  
  •  NEWS   John Hershey gives invited talk at Johns Hopkins University on MERL's "Deep Clustering" breakthrough
    Date: March 4, 2016
    Where: Johns Hopkins Center for Language and Speech Processing
    MERL Contacts: John Hershey; Jonathan Le Roux
    Research Areas: Multimedia, Speech & Audio
    Brief
    • MERL researcher and speech team leader, John Hershey, was invited by the Center for Language and Speech Processing at Johns Hopkins University to give a talk on MERL's breakthrough audio separation work, known as "Deep Clustering". The talk was entitled "Speech Separation by Deep Clustering: Towards Intelligent Audio Analysis and Understanding," and was given on March 4, 2016.

      This is work conducted by MERL researchers John Hershey, Jonathan Le Roux, and Shinji Watanabe, and MERL interns, Zhuo Chen of Columbia University, and Yusef Isik of Sabanci University.
  •  
  •  AWARD   MERL's Speech Team Achieves World's 2nd Best Performance at the Third CHiME Speech Separation and Recognition Challenge
    Date: December 15, 2015
    Awarded to: John R. Hershey, Takaaki Hori, Jonathan Le Roux and Shinji Watanabe
    MERL Contacts: John Hershey; Takaaki Hori; Jonathan Le Roux
    Research Areas: Multimedia, Speech & Audio
    Brief
    • The results of the third 'CHiME' Speech Separation and Recognition Challenge were publicly announced on December 15 at the IEEE Automatic Speech Recognition and Understanding Workshop (ASRU 2015) held in Scottsdale, Arizona, USA. MERL's Speech and Audio Team, in collaboration with SRI, ranked 2nd out of 26 teams from Europe, Asia and the US. The task this year was to recognize speech recorded using a tablet in real environments such as cafes, buses, or busy streets. Due to the high levels of noise and the distance from the speaker's mouth to the microphones, this is very challenging task, where the baseline system only achieved 33.4% word error rate. The MERL/SRI system featured state-of-the-art techniques including multi-channel front-end, noise-robust feature extraction, and deep learning for speech enhancement, acoustic modeling, and language modeling, leading to a dramatic 73% reduction in word error rate, down to 9.1%. The core of the system has since been released as a new official challenge baseline for the community to use.
  •