Learning Concise and Minimally Uncertain Models

Entropic estimation is a framework for simultaneously estimating the structure and parameters of a probabilistic model. It is a continuous formalization of Occam's razor: Seek the smallest and most unambiguous model that can explain the data. The resulting models are more predictive, more general, and much more interpretable than models obtained from conventional learning methods. Entropic estimation often induces a model quite close to the mechanism that generated the signal.

Background & Objective:  To tune in a radio station, you get a machine-a radio-and you vary a parameter-its frequency-until the machine has a good fit with the signal you want. Computer "learning" is very similar: The computer gets a machine-usually a complex statistical model-and programmatically varies hundreds or thousands of parameters until the model fits the signal. There is a catch: An expert must design a model that matches the gross structure of the signal. This means laborious trial-and-error, sometimes without success. We seek efficient methods to simultaneously design the machine and find good parameter settings.

Technical Discussion:  Entropy is a measure of disorder, spread, and uncertainty. We learn by minimizing three entropies, one assessed on the model, one assessed on the model's picture of the data, and one assessed on aspects of the data not captured by the model. There is an intuitive probabilistic interpretation in terms of maximizing the posterior given by Bayes' rule, which measures our confidence in a model after having seen some data:
    P(model given data) = P(data given model) . P(model) / P(data)
The likelihood P(data given model) is a known function f(X,T) of data X parameters T. The prior P(model) is usually unknown. Entropic estimation gives us a universally applicable prior,
    P(model) ~ exp(-H(model))
where  H() measures the entropy, or uncertainty, of the model. The prior summarizes our background knowledge--this particular prior can be derived from the simple assertion, "This task is learnable." It is also a bias for small, unambiguous, and highly structured models.
    An estimator yields parameter values that maximize the posterior. We have derived exact solutions for estimators for a large variety of likelihood functions--even when the problem leads to systems of transcendental equations. These yield very fast learning algorithms that sculpt theories out of overcomplete random models by extinguishing excess and inappropriate parameters, thereby removing terms from the likelihood function f(X,T). Variants on the framework give trimming criteria that tell us when the model can be simplified by removing parts of the likelihood function, and deterministic annealing procedures that allow us to avoid suboptimal solutions. Because entropy minimization will find hidden structures in the data, it can be thought of as automated exploratory science, discovering previously unknown hidden causes that explain our observed world.

Contact:  Matthew Brand

Technology Areas:
Artificial Intelligence
Graphics

Modification Date:  June 26, 2001