Body Part Alignment and Temporal Attention for Video-Based Person Re-Identification


We present a novel deep neural network for video-based person re-identification that is designed to address two of the major issues that make this problem difficult. The first is dealing with misalignment between cropped images of people. For this we take advantage of the OpenPose network [2] to localize different body parts so that corresponding regions of feature maps can be compared. The second is dealing with bad frames in a video sequence. These are typically frames in which the person is occluded, poorly localized or badly blurred. For this we design a temporal attention network that analyzes feature maps of multiple frames to assign different weights to each frame. This allows more useful frames to receive more weight when creating an aggregated feature vector representing an entire sequence. Our resulting deep network improves over state-of-theart results on all three standard test sets for video-based person re-id (PRID2011 [8], iLIDS-VID [21] and MARS [27]).