Classification and Pose Estimation of Vehicles in Videos by 3D Modeling within Discrete-Continuous Optimization

This paper presents a framework for classification and pose estimation of vehicles in videos by assuming their given 3D models. We rank possible poses and types for each frame and exploit temporal coherence between consecutive frames for refinement. As a novelty, first, we cast the estimation of a vehicle's pose and type as a solution of a continuous optimization problem over space and time. Due to a non-convexity of this problem, good initial starting points are important. We propose to obtain them by a discrete temporal optimization reaching a global optimum on a ranked discrete set of possible types and poses. Second, to guarantee effectiveness of the proposed discrete-continuous optimization, we present a novel way to efficiently reduce the search space of potential 3D model types and poses for each frame for the discrete optimizer. It avoids common expensive evaluation of all possible discretized hypotheses. The key idea towards efficiency lies in a novel combination of detecting the vehicle, rendering the 3D models, matching projected edges to input images, and using a tree structured Markov Random Field to get fast and globally optimal inference and to force the vehicle follow a feasible motion model in the initial phase. Quantitative and qualitative experiments on a variety of videos with vast variation of vehicle types show superior results to state-of-the-art methods