TALK    Learning Intermediate-Level Representations of Form and Motion from Natural Movies

Date released: February 22, 2012


  •  TALK    Learning Intermediate-Level Representations of Form and Motion from Natural Movies
  • Date & Time:

    Wednesday, February 22, 2012; 11:00 AM

  • Abstract:

    The human visual system processes complex patterns of light into a rich visual representation where the objects and motions of our world are made explicit. This remarkable feat is performed through a hierarchically arranged series of cortical areas. Little is known about the details of the representations in the intermediate visual areas. Therefore, we ask the question: can we predict the detailed structure of the representations we might find in intermediate visual areas?

    In pursuit of this question, I will present a model of intermediate-level visual representation that is based on learning invariances from movies of the natural environment and produces predictions about intermediate visual areas. The model is composed of two stages of processing: an early feature representation layer, and a second layer in which invariances are explicitly represented. Invariances are learned as the result of factoring apart the temporally stable and dynamic components embedded in the early feature representation. The structure contained in these components is made explicit in the activities of second-layer units that capture invariances in both form and motion. When trained on natural movies, the first-layer produces a factorization, or separation, of image content into a temporally persistent part representing local edge structure and a dynamic part representing local motion structure. The second-layer units are split into two populations according to the factorization in the first-layer. The form-selective units receive their input from the temporally persistent part (local edge structure) and after training result in a diverse set of higher-order shape features consisting of extended contours, multi-scale edges, textures, and texture boundaries. The motion-selective units receive their input from the dynamic part (local motion structure) and after training result in a representation of image translation over different spatial scales and directions, in addition to more complex deformations. These representations provide a rich description of dynamic natural images, provide testable hypotheses regarding intermediate-level representation in visual cortex, and may be useful representations for artificial visual systems.

  • Speaker:

    Dr. Charles Cadieu
    McGovern Institute for Brain Research, MIT

    Charles received his BS and MEng from MIT in EECS and conducted research in computational neuroscience at CBCL under Tomaso Poggio. He then went to UC Berkeley for his PhD in Neuroscience working in the Redwood Center for Theoretical Neuroscience. He has worked on models of biological visual processing, visual representation for computer vision, and statistical techniques for understanding complex neural phenomena. He has also played key roles in a number of companies, including IQ Engines, a startup working to bring visual search to mobile phones.

  • MERL Host:

    Jonathan Le Roux

  • Research Area:

    Speech & Audio