TR2018-137

Recurrent Multi-frame Single Shot Detector for Video Object Detection



Deep convolutional neural networks (CNNs) have recently proven extremely capable of performing object detection in single-frame images. In this work, we extend a class of CNNs designed for static image object detection to multi- rame video object detection. Our Multi-frame Single Shot Detector (Mf-SSD) augments the Single Shot Detector (SSD) meta-architecture [16] to incorporate temporal information from video data. By adding a convolutional recurrent layer to an SSD architecture our model fuses features across multiple frames and takes advantage of the additional spatial and temporal information available in video data to improve the overall accuracy of the object detector. Our solution uses a fully convolutional network architecture to retain the impressive speed of single-frame SSDs. In particular, our Recurrent Mf-SSD (based on the SqueezeNet+ architecture) can perform object detection at 50 frames per second (FPS). This model improves upon a state-of-the-art SSD model by 2.7 percentage points in mean average precision (mAP) on the challenging KITTI dataset. Additional experiments demonstrate that our Mf-SSD can incorporate a wide range of underlying network architectures and that in each case, the multi-frame model consistently improves upon single-frame baselines. We further validate the efficacy of our RMf-SSD on the Caltech Pedestrians dataset and find similar improvements of 5% on the pre-defined Reasonable set