- MERL Seminar Series.)
(Learn more about the
Date & Time:
Tuesday, September 28, 2021; 1:00 PM EST
While computer vision has made significant progress by "looking" — detecting objects, actions, or people based on their appearance — it often does not listen. Yet cognitive science tells us that perception develops by making use of all our senses without intensive supervision. Towards this goal, in this talk I will present my research on audio-visual learning — We disentangle object sounds from unlabeled video, use audio as an efficient preview for action recognition in untrimmed video, decode the monaural soundtrack into its binaural counterpart by injecting visual spatial information, and use echoes to interact with the environment for spatial image representation learning. Together, these are steps towards multimodal understanding of the visual world, where audio serves as both the semantic and spatial signals. In the end, I will also briefly talk about our latest work on multisensory learning for robotics.
Dr. Ruohan Gao
Ruohan Gao is a Postdoctoral Fellow in the Computer Science Department at Stanford University. He obtained his Ph.D. at The University of Texas at Austin and B.Eng. at The Chinese University of Hong Kong. Ruohan works in the fields of computer vision and machine learning with particular interests in audio-visual learning from videos and embodied learning with multiple modalities. His research has been recognized by 2021 Michael H. Granof University's Best Dissertation Award, UT Austin Outstanding Dissertation Award (2021), the Google PhD Fellowship (2019-2021), the Adobe Research Fellowship (2019), and a Best Paper Award Finalist at Conference on Computer Vision and Pattern Recognition (CVPR) 2019 for his work on 2.5D visual sound.