Mining features for sequence classification
We have applied data mining techniques to the task of feature selection in order to improve the performance of classification algorithms on sequential examples, such as text or DNA sequences. In the past, it has been difficult to apply classification algorithms to sequential examples because of the vast number of potentially useful features for describing each example. We have adapted data mining algorithms to act as a preprocessor for classification algorithms. Our data mining algorithms search through billions of features and select the ones that are most useful for classification. The rules found by data mining are converted into a set of annotations, which are used to enrich the description of the examples to be classified by the machine learning algorithm.
Background & Objective: This work was originally motivated by the task of monitoring the execution of plans or schedules in order to predict failures before they arise. In this case, there are many features for describing each event and thus an exponential number of features for describing sequences of events. Our approach allows us to apply machine learning algorithms to the task of monitoring temporal processes.
Technical Discussion: Some classification algorithms work well when there are thousands of features for describing each example. In some domains, however, the number of potentially useful features is exponential in the size of the examples. Data mining algorithms have been used to search through billions of rules, or patterns, and select the most interesting ones. We have adapted data mining algorithms to act as a preprocessor to construct a set of features to use for classification. Each pattern produced by FeatureMine is used to create a new boolean features for describing each example. If the pattern holds in the example, then the features value is true, and otherwise is false. We train a standard classification algorithm, such as Winnow or Naive Bayes, on the enriched examples. Our experiments show that the features produced by FeatureMine improve classification accuracy by 10-50% on several challenging problem sets.
Outside Collaborations: This work is being done in conjunction with the University of Rochester and the Rensselaer Polytechnic Institute.
Contact: Joseph Katz
Technology Area: Artificial Intelligence
Modification Date: September 12, 2007

