TR2022-082

Quantifying Predictive Uncertainty for Stochastic Video Synthesis from Audio


    •  Chatterjee, M., Ahuja, N., Cherian, A., "Quantifying Predictive Uncertainty for Stochastic Video Synthesis from Audio", IEEE Conference on Computer Vision and Pattern Recognition (CVPR), June 2022.
      BibTeX TR2022-082 PDF
      • @inproceedings{Chatterjee2022jun,
      • author = {Chatterjee, Moitreya and Ahuja, Narendra and Cherian, Anoop},
      • title = {Quantifying Predictive Uncertainty for Stochastic Video Synthesis from Audio},
      • booktitle = {IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
      • year = 2022,
      • month = jun,
      • url = {https://www.merl.com/publications/TR2022-082}
      • }
  • MERL Contact:
  • Research Areas:

    Artificial Intelligence, Computer Vision, Machine Learning, Speech & Audio

Abstract:

In this paper, we study the problem of synthesizing video frames from the accompanying audio and a few past frames – a task with immense potential, e.g., in occlusion reasoning. Prior methods to solve this problem often train deep learning models that derive their training signal by computing the mean-squared error (MSE) between the generated frame and the ground truth. However, these techniques do not account for the predictive uncertainty of the frame generation model. This frailty might result in sub-optimal training, especially when this uncertainty is high. To address this challenge, we introduce Predictive Uncertainty Quantifier (PUQ) - a stochastic quantification of the generative model’s predictive uncertainty, which is then used to weigh the MSE loss. PUQ is derived from a hierarchical, variational deep net and is easy to implement and incorporate into audio-conditioned stochastic frame generation methods. Experiments demonstrate our method’s faster and improved convergence versus competing baselines on two challenging datasets.