Quantifying Predictive Uncertainty for Stochastic Video Synthesis from Audio


In this paper, we study the problem of synthesizing video frames from the accompanying audio and a few past frames – a task with immense potential, e.g., in occlusion reasoning. Prior methods to solve this problem often train deep learning models that derive their training signal by computing the mean-squared error (MSE) between the generated frame and the ground truth. However, these techniques do not account for the predictive uncertainty of the frame generation model. This frailty might result in sub-optimal training, especially when this uncertainty is high. To address this challenge, we introduce Predictive Uncertainty Quantifier (PUQ) - a stochastic quantification of the generative model’s predictive uncertainty, which is then used to weigh the MSE loss. PUQ is derived from a hierarchical, variational deep net and is easy to implement and incorporate into audio-conditioned stochastic frame generation methods. Experiments demonstrate our method’s faster and improved convergence versus competing baselines on two challenging datasets.