Bilinear Models for Style-Adaptive Recognition
We have developed a new, style-adaptive recognition method, which may be useful for speech, face, or character recognition. We find that this style-adaptive recognition gives superior performance over traditional methods on several recognition tasks involving speech or face recognition. On one benchmark test involving spoken vowels, our method achieved 76% correct recognition, while the best traditional approach achieved 56% correct.
Background & Objective: Many learning problems call for recognition, classification, or synthesis of data generated through the interaction of multiple independent factors. An optical character recognition (OCR) system may be required to recognize a standard set of characters in new fonts. A speech analysis system may be required to classify the utterances of many new speakers into known word categories. An adaptive controller may be required to produce learned state-space trajectories under novel payload conditions. Each of these problems decomposes naturally into two factors, which might be thought of intuitively as content and style (e.g. "letter" & "font", "word" & "speaker", "trajectory" & "load"), and each factor includes many possible elements (e.g. "letter' includes "A", "B", "C", etc.; "font" includes "Times", "Helvetica", "Courier", etc.).
Technical Discussion: We introduce a class of stochastic models, called Separable Mixture Models (SMMs). SMMs combine an expressive representation of factor interactions as linear transformations, with simple and efficient learning procedures based on the Expectation-Maximization (EM) algorithm. We demonstrate the usefulness of SMMs for recognition, classification, and synthesis tasks with applications to speech and typography. In these applications, the system learns separate models for style (e.g. speaker or font) and content (e.g. vowel or letter), in order to classify style and content independently of each other, These results demonstrate that using models which incorporate speaker identity during training can lead to substantial improvements in vowel classification, even when little or no information about speaker identity is available during testing with new speakers. This task is difficult largely because there are only a small number of styles represented in the training data, and most of the styles represented in the training set are somewhat different from these. Under these general conditions, we expect the strategy of building explicit (and independent) style models and content models to be a promising approach to generalizing content classification across new styles.
Contacts:
Joseph Katz
| Technical Reports: | |
| Learning Bilinear Models for Two-Factor Problems in Vision | |
| Separating style and content | |
Technology Area: Artificial Intelligence
Modification Date: January 23, 2007

