Mitsubishi Electric Research Laboratories

Broadcast Video Content Segmentation by Supervised Learning

Citation:   Wilson, K.W.; Divakaran, A., "Broadcast Video Content Segmentation by Supervised Learning", Multimedia Content Analysis, ISBN: 978-0-387-76569-3, pp. 1-17, March 2009 (SpringerLink)
MERL Report:  TR2008-091
MERL Contact:   Kevin W. Wilson


Typical summarization system framework. Details and emphasis vary, but most previous summarization systems take in audiovisual content and extract low-level features. They then find temporal patterns, typically by clustering on feature similarity or segmenting based on feature coherence. From these patterns, estimates of semantic structure, such as shot or scene change locations, are made. Finally, the results are presented in a user-friendly format, such as a set of informative keyframes or a short video skim.

Today's viewers are presented with huge amounts of content from broadcast, cable, pay-per-view, internet streaming, and other sources. An expanding array of display devices and viewing environments further motivates the need for video summarization, rapid navigation, and management tools. However, most video summarization goals are stated in semantic terms ("the most informative summary", "the most exciting plays of the match"), while our computational tools are best at extracting simple features like audio energy and color histograms. This chapter presents our supervised learning approach to bridge this "semantic gap" by using hand-labeled examples to locate all the scene changes within content in a way that will work across a broad range of genres, including news, situation comedies, dramas, how-to shows, and more. We believe this is a useful and semantically meaningful goal that can serve as a building block in a variety of higher-level video summarization systems.

 Read the full technical report (PDF: 429 kB)