March 26, 2014.

Paper published: Occam's Razor in sensorimotor learning

Genewein T, Braun DA (2104) Occam's Razor in sensorimotor learning. Proceedings of the Royal Societey B 281:20132952. doi: 10.1098/rspb.2013.2952

Link Download

Overview

Bayesian model selection
Bayesian Occam’s razor
Experimental design - marginal likelihood of a Gaussian processes
Results

Our paper on sensorimotor regression and the preference of simpler models got published in Proceedings B of the Royal Society.

In the paper we present a sensorimotor regression paradigm that allows for testing whether humans act in line with Occam’s razor and prefer the simpler explanation if two models explain the data equally well.

Bayesian model selection

We modeled human choice behavior with Bayesian model selection. In Bayesian model selection, models $M_1$ and $M_2$ are compared by taking their posterior probability ratio (given the data $D$)

$\underbrace{\frac{P(M_1|D)}{P(M_2|D)}}_{\text{posterior odds}} = \underbrace{\frac{P(D|M_1)}{P(D|M_2)}}_{\text{Bayes factor}} \underbrace{\frac{P(M_1)}{P(M_2)}}_{\text{prior odds}},$

where the model evidence or marginal likelihood is given by marginalizing over the model parameters $\theta$

$P(D|M_i)=\int P(D|\theta,M_i)P(\theta|M_i) d\theta.$

If the ratio of posterior odds is larger than one, model $M_1$ should be preferred - if the posterior odds ratio is smaller than one, the data is in favor of $M_2$ . If the strength of preference is ignored, the previous rule leads to a deterministic model selection mechanism. Additionally, the magnitude of the posterior odds ratio is an indicator of the strength of the preference which can be used to derive stochastic model selection mechanisms (for instance by combination with a softmax selection rule). In case of equal prior probabilities for both models, the quantity that governs model selection is the Bayes factor (sometimes called Occam factor), which is the ratio of marginal likelihoods.

Bayesian Occam’s razor

It is known that Bayesian model selection embodies Occam’s razor, that is if two models explain the data equally well, prefer the simpler model. Below are two intuitions why this happens - for a more rigorous discussion see Chapter 28 in David MacKay’s Information Theory, Inference, and Learning Algorithms (Cambridge University Press, 2003).

The marginal likelihood $P(D\vert M_i)$ is obtained by considering the likelihood of the parameters given the data $P(D\vert \theta,M_i)$ under all possible parameter-settings (by taking the integral, see above). A very complex model is highly likely to have a particular parameter-setting that explains the data very well and leads to a large value of the likelihood function (think of polynomial regression with polynomials of a high degree). However, because of the marginalization, all parameters have to be taken into account and a very complex model will also contain many parameter-settings that yield a very bad likelihood. In contrast, an overly simple model will on average neither explain the data very badly but also not very well, leading to a low marginal likelihood as well. Only models that are complex enough but not overly complex will get a large marginal likelihood score. Thus Bayesian model selection is intrinsically regularized.
A complex model can explain many data-sets and must thus spread its probability mass over a quite large range (or volume). In contrast, a simple model can only explain few data-sets and has its probability mass more concentrated. If observed data happens to fall in a region where both models can explain the data, the simpler model is very likely to have more probability mass in that region and will thus be favored.

Experimental design - marginal likelihood of a Gaussian processes

In our experiment we test whether humans follow Occam’s razor and prefer the simpler model if two models explained the data equally well. We translated the classical regression problem of finding a curve underlying noisy observations into a sensorimotor task: in our experiment participants would see some dots that represent noisy observations of an underlying curve. Their task was to draw their best guess of the underlying curve. In training trials, after participants had drawn their curve, they were shown the true underlying curve that generated the noisy observations. Importantly, the underlying curve could only be generated by one of two models - a simple model that leads to smooth curves and a complex model that leads to more wiggly curves. In training trials, participants were informed at the start of the trial about the generative model with a color cue. In test trials we showed participants the same stimulus (noisy observations) but with a neutral color cue that did not indicate the underlying model. We asked participants to draw their best guess of the underlying function, informing them that the generative model could only be one of the two models experienced previously. From their drawing we could infer their model choice and could thus test human behavior against theoretical choice behavior modeled with Bayesian model selection.

The two underlying models that generated the curves were Gaussian processes (GPs) with a squared-exponential kernel with either a long or a short length-scale (corresponding to a simple or complex model respectively). We found Gaussian processes very suitable for generating natural trajectories (which is much more difficult with polynomials for instance) but perhaps most importantly we could use the Gaussian processes to generate trials where the noisy observations can be fitted with both models equally well. Inspecting the (closed-form) analytical expression for the marginal log likelihood of the Gaussian process reveals that it is composed of a data-fit-error term and a data-independent complexity term

$\log P(y|X,M_\lambda) = -\underbrace{\frac{1}{2} y^\text{T} K_\lambda^{-1} y}_{\text{data-fit-error}} - \underbrace{\frac{1}{2}\log |K_\lambda| - \frac{N}{2}\log 2\pi}_{\text{model complexity}},$

where $y$ denotes (noisy) observations at locations $X$ , $M_{\lambda}$ indicates the models with different length-scales $\lambda$ , $K_{\lambda}$ is the covariance matrix of the GP obtained by evaluating the kernel function for all input-pairs $k_{\lambda}(x_i, x_j)$ (which depends on the length-scale $\lambda$ ) and $N$ is the number of observations.
Importantly, the model complexity term is independent of the observations $y$ (in our case the number of observations $N$ was fixed throughout the whole experiment) and is governed by the length-scale parameter of the different models. The complexity term consists of the log determinant of the covariance matrix and a factor that depends on the number of observations. One explanation why this term corresponds to a complexity term is that it is the entropy of the (posterior) GP (since the GP is essentially a multivariate Gaussian) - a large entropy would thus imply a large complexity and vice versa. See the publication for more discussion on the complexity term and also some simulations that confirm that the shorter-lengthscale model also incurs a larger complexity-term.

Given the analytical expression of the data-fit-error term that is independent of the model complexity allowed us to use the GP framework to create stimuli where the noisy observations lead to the same data-fit-error term under both models. We used these trials to test whether humans would prefer the simpler model as an explanation in case both models would fit the data equally well. Additionally, we could also generate trials where the data-fit-error term was different under both models and use the marginal likelihood ratio to predict theoretical choice probabilities according to Bayesian model selection (paired with a softmax selection rule).

Results

In equal-error trials where the data-fit-error term was the same under both models, we found that humans followed Occam’s razor and strongly preferred the simpler model. In general trials with arbitrary data-fit-error terms we found participants’ behavior to be quantitatively consistent with the theoretical model of Bayesian model selection combined with a softmax model selection mechanism.

Additionally we performed a control experiment to rule out that the preference of the simpler model was simply due to a preference for lower physical effort (as smooth trajectories require less effort to draw) by having participants indicate their model choice with a mouse-click an then drawing a trajectory automatically according to their model choice. We found that the physical effort of drawing had on impact no model choice in our task.
We also performed a control experiment to rule out that the preference was not due to model complexity but simply due to smoothness of trajectories. We designed the two GP models in a way such that the wiggly model became the simple model. To achieve this we used a wiggly mean-function in the short length-scale model but then tuned the parameters of the GP such that the trajectories generated by the model are very similar to the mean and have very little variation. This lead to a wiggly generative model that produced functions of high spatial frequency but with very little variation (simple model, because it only explains a very narrow range of observations) and a smooth generative model that produced a large variety of functions (complex model, even though the generated functions have low spatial frequency). See the publication for more details.

Tim Genewein