- MERL Seminar Series.)
(Learn more about the
Date & Time:
Tuesday, March 1, 2022; 1:00 PM EST
Humans learn spoken language and visual perception at an early age by being immersed in the world around them. Why can't computers do the same? In this talk, I will describe our ongoing work to develop methodologies for grounding continuous speech signals at the raw waveform level to natural image scenes. I will first present self-supervised models capable of discovering discrete, hierarchical structure (words and sub-word units) in the speech signal. Instead of conventional annotations, these models learn from correspondences between speech sounds and visual patterns such as objects and textures. Next, I will demonstrate how these discrete units can be used as a drop-in replacement for text transcriptions in an image captioning system, enabling us to directly synthesize spoken descriptions of images without the need for text as an intermediate representation. Finally, I will describe our latest work on Transformer-based models of visually-grounded speech. These models significantly outperform the prior state of the art on semantic speech-to-image retrieval tasks, and also learn representations that are useful for a multitude of other speech processing tasks.
The University of Texas at Austin
David Harwath is an assistant professor in the computer science department at UT Austin. His research focuses on multimodal, self-supervised learning algorithms for speech, audio, vision, and text. Under the supervision of James Glass, his doctoral thesis introduced models for the joint perception of speech and vision. This work was awarded the 2018 George M. Sprowls Award for best computer science PhD thesis at MIT. He holds a B.S. in electrical engineering from UIUC (2010), a S.M. in computer science from MIT (2013), and a Ph.D. in computer science from MIT (2018).