Tim Genewein

Paper presented @ IROS 2017: An Information-Theoretic On-Line Update Principle for Perception-Action Coupling

2017-09-25T14:30:00+00:00

Peng Z, Genewein T, Leibfried F, Braun DA (2017). An Information-Theoretic On-Line Update Principle for Perception-Action Coupling. IROS 2017. doi: 10.1109/IROS.2017.8202240

Our paper: An Information-Theoretic On-Line Update Principle for Perception-Action Coupling was presented at this year’s IROS in Vancouver. The paper builds upon the information-theoretic principle for perception-action coupling where a decision-maker is conceptualized as two serial information-processing channels: the first channel extracts relevant information about the “world-state” from the sensory stream and forms a “percept”. Subsequently an action-stage picks an action based on the internal percept in order to maximize utility or reward in a bounded rational fashion.

The theoretical groundwork for this paper has been layed in our 2015 Frontiers paper but back then applications were limited to discrete toy-problems with a very low-dimensional state- and action-space. In this paper we derive a gradient-based on-line update rule to optimize the information-theoretic optimality principle for perception-action coupling. This allows using neural networks as powerful parametric models, which is the first step towards large-scale, continuous and high-dimensional applications of our principle. In the paper, we show that the gradient-based updates converge to equivalent solutions on a toy example used in our previous paper. Finally, we use the principle to simultaneously learn (bounded) optimal perceptual representations and action-policies for a simple visuomotor grasping task with a simulated NAO robot. Importantly, the perceptual stage of the robot is implemented with a neural network.

Talk @ Amlab Amsterdam: Information-optimal coupling of perception and action through lossy compression

2017-06-01T15:00:00+00:00

Genewein T (2017) Information-optimal coupling of perception and action through lossy compression

As part of the collaboration between Bosch Center for AI and Max Welling’s Amlab, or more precisely UvA-Bosch Delta Lab headed by Zeynep Akata, I visited UvA for a couple of days and also gave a talk (see abstract below). I was very impressed with the lab and had quite a few very interesting discussions that need to be followed up on. I find it remarkable how world-class labs have this very particular feel to them - it’s very hard to nail down which factors precisely are different compared to many other labs. Yet, the atmosphere instantly reminded me a lot of the MPI Tuebingen. It guess it’s mostly about the people and the discussion culture - everybody is really clever and has very convincing arguments why their approach/idea is really good, at the same time everybody is very open to hear about and critically reflect on other’s ideas. It was a great visit - the campus is really nice as well and Amsterdam is certainly worth spending some time.

Here’s the abstract of my talk, find out more about the topics on the Research pages or alternatively this paper):

Perception is processing of sensory information into useful representations (percepts or features) for acting. It is common practice in many application-domains, such as robotics or autonomous driving, to treat perception (e.g. computer vision) and action (e.g. planning or control) as more or less separate problems. In such cases, the perceptual representation is typically manually designed, e.g. a semantic segmentation of the input image. While such a representation might be sufficient for certain tasks, it is rarely optimal, which can lead to poor robustness and even render the learning problem for the action stage of the system more difficult than necessary (since the action stage needs to learn to ignore irrelevant information captured by the perceptual representation). This talk presents an optimality principle for coupling a perceptual stage that feeds into an action stage. The resulting objective function can be derived from well-known optimality principles (Bayesian inference, free-energy minimization and rate-distortion theory). As a result, the perceptual stage is formalized as a lossy compressor. The analytical solution to the optimality principle exhibits interesting theoretical properties - in particular the corresponding perceptual representations can be shown to (1) capture a lot of (task-) relevant information with (2) little redundancy, while (3) being robust to (task-) irrelevant variation. Another result is that the implicit objective of the perceptual stage is to enable the downstream action stage to operate most efficiently (i.e. with maximal free energy).

PhD thesis successfully defended - PhD completed

2017-03-29T14:30:00+00:00

Download thesis

Today, I have presented and defended my PhD thesis, which concludes my graduation. I am very happy and grateful for everything I have learned through my PhD. Particularly for the people I have met along the way - many of which have inspired me and helped me broaden my horizon. The time at the MPI Tuebingen will be very hard to beat - it is an amazing place. A big thanks to Daniel Braun and Pedro Ortega for putting together our group and also my colleagues Zhen, Felix and Jordi. I will miss you guys…

You can download my dissertation here. Feel free to use the the material for academic research or presentations as long as you respect the copyright and cite accordingly. Please be aware that the papers that are part to this (cumulative) thesis are subject to separate copyright statements (issued by the corresponding journals).

Paper accepted: On Detecting Adversarial Perturbations

2017-02-14T15:00:00+00:00

Metzen JH, Genewein T, Fischer V, Bischoff B (2017). On Detecting Adversarial Perturbations. International Conference on Learning Representation (ICLR) 2017.

Our paper on detecting adversarial perturbations has been accepted for publication at this year’s ICLR. In the paper we explore the possibility to train a reliable detector for discriminating whether an image is perturbed by adversarial noise or not. Since adversarial perturbations are characterized by being quasi-imperceptible to humans, it is not clear a priori whether a detector can be found to reliably distinguish between genuine inputs and adversarial examples. We found that a deep neural network can be successfully trained to detect adversarial perturbations in a classification task (CIFAR-10, Imagenet).
In the paper, we explored different “depths of attachement” of the detector-network to the clasifier-network. Additionally, we looked into how well detector networks generalize across different adversarial attack methods. Finally, we tested whether an attacker with access to both, the classifier and the detector, could construct adversarial perturbations that fool both systems. We found this to be the case. We show that training with adversarial examples that are generated dynamically during training allows to harden against these attacks. While we cannot create a classifier/detector pair that successfully detects $100\%$ of adversarial examples, we can show that it is possible to train good adversarial detectors that generalize to some degree across attack methods. These results indicate that adversarial attacks might also be characterized by certain regularities or structure which clearly separates adversarial examples from genuine data. Understanding these regularities might provide some insights into the nature of adversarial examples.

Paper presented @ ECML 2016: Planning with Information-Processing Constraints and Model Uncertainty in Markov Decision Processes

2016-09-22T11:00:00+00:00

Grau-Moya J, Leibfried F, Genewein T, Braun D.A. (2016) Planning with Information-Processing Constraints and Model Uncertainty in Markov Decision Processes. ECML PKDD 2016, Riva del Garda, Italy, Proceedings, Part II

Jordi Grau-Moya presented our paper on Planning with Information-Processing Constraints and Model Uncertainty in Markov Decision Processes at ECML 2016. In the paper, the free energy principle for decision-making is used to model different traits of (human) decision-making in a Markov decision-process (MDP):

bounded rationality, that is decision-making under limited computational resources. In the MDP setting this corresponds to agents, that cannot produce deterministic policies but rather have some degree of stochasticity (noise) in their action-selection/-execution.
model uncertainty - the agent is aware that its (state-transition-) model of the environment does not precisely reflect the true world and thus allows optimistic or pessimistic deviations from that model to drive exploration.

The application of the free-energy principle to planning in the MDP setting leads to a value-iteration like scheme, that allows to model both bounded rationality but also model uncertainty. Interestingly, the degree of bounded rationality but also the degree of model uncertainty is governed by a continuous parameter. For model-uncertainty, this parameter can produce exploration behavior on a a continuum between optimism, indifference or pessimism in the face of uncertainty, which could be very interesting for a general robot-learning framework where the model-uncertainty attitude is governed by the task-setting (an autonomous car should probably explore in a very conservative fashion, whereas a simulated robot can potentially explore with great optimism to speed up learning).

Moving on: Cognitive Systems research group of Bosch

2016-07-15T09:00:00+00:00

I am happy to announce that I have joined the Cognitive Systems research group of Bosch. The group is part of the corporate research sector of Bosch and is located at the brand-new research campus of Bosch in Renningen (close to Stuttgart).

The group features a great mix of people with diverse backgrounds, working on various aspects of making machines smarter. I have joined the group as a research scientist in the deep learning team and I am looking forward to getting my hands dirty with state-of-the-art deep learning methods. For a while, I have wanted to get some hands-on deep learning skills as I think that deep learning is a tool that must currently not be missing from any machine learner’s tool-belt. But rather than just learning how to use deep learning as a mere tool, the goal is to rapidly advance to a stage where I can also contribute towards academic research since I believe that there is a lot of overlap with the research topics of my PhD and some very timely questions in deep learning research. One of the main reasons for me choosing to join the Cognitive Systems group is that the group not only provides a fruitful and stimulating research environment but also encourages working on the advancement of current research frontiers.

Talk @ GRASP lab, UPenn and UAI 2016

2016-06-20T14:00:00+00:00

Genewein T, Grau-Moya J (2016) GRASP Special Seminar: Decision-Making with Information Constraints: Free Energy Foundations and Applications, UPenn Philadelphia.

Jordi Grau-Moya and I gave a special seminar at the GRASP lab (University of Pennsylvania) with the topic: Decision-Making with Information Constraints: Free Energy Foundations and Applications. The seminar was split in two parts: the first part was given by Jordi, where he talked about the theoretical foundations of the free energy principle for decision making and then showed how the principle can be used to model different traits of (human) decision-making: bounded rationality, that is decision-making under limited computational resources, model uncertainty (the agent is aware that its model does not precisely reflect the true world and thus allows optimistic or pessimistic deviations from that model to drive exploration) and risk sensitivity (in the light of known probabilities, an agent could still prefer risk-seeking or risk-averse behavior compared to simply taking the average). His talk concluded with an application of the principle to an MDP setting, which leads to a value-iteration like scheme for planning in MDPs, that allows to model both bounded rationality but also model uncertainty (the agent has a model over the unknown transition-probabilities and can deviate towards an optimistic or pessimistic attitude with respect to the unknown transition-probabilities). See the preprint of the corresponding publication here - this work will be presented in a talk at ECML 2016.

In the second part of the talk, I explained the connection between rate-distortion theory and the free energy principle for decision-making and showed how rate-distortion for decision-making can lead to the emergence of natural levels of (behavioral) abstractions. The principle was extended to three variable systems, which allows to model bounded optimal perception-action systems and hierarchies of abstraction. See the Research pages for more information or alternatively this paper.

After a short stay with Pedro Ortega, we visited UAI2016 in Jersey City, NJ. UAI was definitely interesting, even though it is a rather small conference. This allows to easily get into touch with people and to easily find time for solid discussions (unlike other conferences where it is almost impossible to approach somebody at their poster). There was quite a bit of interesting and highly related work. Perhaps the most interesting novelty for me was that people in (differential) privacy deal with the problem of “finding the right amount of noise” in order to destroy some information (that would conceal privacy) while keeping the relevant information for a certain task. This should actually map straightforwardly onto rate-distortion theory, where the goal is also to discard irrelevant information and keep only relevant information. In rate-distortion, information is transmitted over a channel of limited capacity - one way to limit a channel’s capacity is by injecting noise. I should definitely look into this connection in the future and see if this has already been fully fleshed out in current work on privacy.

Talk @ ICRA workshop: Information-theoretic bounded rationality in perception-action systems

2016-05-16T15:00:00+00:00

Genewein T (2016) Information-theoretic bounded rationality in perception-action systems. ICRA 2016 workshop on task-driven perceptual representations: sensing, planning and control under resource constraints, Stockholm.

I was given the opportunity to present my work on information-theoretic bounded rationality in perception-action systems with a talk at the workshop on task-driven perceptual representations: sensing, planning and control under resource constraints at ICRA 2016 in Stockholm . In the talk I presented the information-theoretic framework for bounded rationality that trades off large expected utility against low information processing cost (see the Research pages for more information or alternatively this paper).

The central idea of the optimality principle is that any change in behavior incurs computation and in an agent with limited computational capacity this computation is costly. The cost of computation should thus be traded off against achieving a high expected utility. Interestingly, the resulting principle is identical to the rate distortion problem (the information-theoretic framework for lossy compression) and has strong formal ties to free energy minimization in thermodynamics. The problem in lossy compression is essentially the same as when forming abstractions: relevant information must be separated from irrelevant information (noise). The rate distortion principle for bounded rational decision-making has a parameter that governs the trade off between utility and information processing cost and one result (see here) is that changing this parameter leads to the emergence of natural levels of abstraction.

In the second part of the talk, the information-theoretic principle for bounded rationality is applied to a two-stage perception-action system. In short, the system has two computational stages that are subject to limited computational resources: a perceptual stage that takes a world-state and transforms it into an (internal) percept and an action-stage that computes optimal actions in response to the world-state, based only on the percept. Classically, perception is often treated as an inference problem and correspondingly the goal of perception is to represent the (latent) world-state as faithfully as possible, meaning that the world-state can be predicted well from the percept. Importantly, the inference-problem is decoupled from the action-part of the system.
The consequent application of the information-theoretic principle requires to trade off gains in utility against the cost that the computation on both stages (perception and action) incurs. Solutions to the resulting optimality principle for perception and action under limited computational resources lead to a tight coupling between perception and action, suggesting that the goal of bounded-optimal perception is to extract the most relevant information for acting which does not necessarily imply that the true world-state can be predicted well from the percept. For instance, consider a case where the world-state carries a lot of information that is completely irrelevant for acting: if the perceptual stage has limited computational resources, then its goal should be to extract the relevant information for acting rather than capturing a lot of the irrelevant information. This naturally couples perception and action. See more on this in our publication, including an intuitive, illustrative example.

Paper by F. Leibfried: Bounded rational decision-making in feedforward neural networks

2016-05-06T11:00:00+00:00

Felix Leibfried just published a very interesting application of the information-theoretic principle for bounded rationality to feedforward neural networks (including convolutional neural networks). This work will be presented in a plenary talk at UAI 2016.

In a nutshell, the information-theoretic optimality principle (similar to the rate-distortion principle, see here) was used to derive a gradient-based on-line update rule to learn the parameters of a neural network. In the principle, gains in utility must be traded off against the computational cost that these gains incur

$\underset{p_\theta(a|w)}{\text{arg max}} \underbrace{\sum_{w,a} p_\theta(w,a) U(w,a)}_{\text{expected utility}} - \frac{1}{\beta} \underbrace{I(W;A)}_{\text{computational demand}},$

where a parametric model $p_\theta(a\vert w)$ is used to describe the stochastic mapping from an input $w$ to an output $a$ . In the paper, Felix shows how to derive a gradient-ascent rule for finding a (locally) optimal $\theta$ . Then, the paper shows how to apply the same principle and derive an on-line gradient-based update rule when using a feedforward neural network for the parametric model. Interestingly, the result is an update-rule very similar to the (well-known) error-backpropagation with an additional regularization-term that results from the mutual information constraint (that formalizes limited computational capacity).

This result adds an interesting angle to the notion of limited computational resources: an alternative way to interpret computational limitations is to view them as a regularizer, that (sort of) artificially imposes computational limitations in order to be robust. This regularizer reflects the computational limitation that results from having small sample sizes - with an infinitely large sample size, corresponding to no computational limitation, the regularizer is not needed. However, under limited sample size (limited information to update the parameters), a regularizer “emulates” limited computational resources that reflect the lack of rich parameter-update information.

The paper concludes by showing simulations that apply the learning rule to a regular multi-layer perceptron as well as a convolutional neural network on the MNIST data set.

Paper published: Bio-inspired feedback-circuit implementation of discrete, free energy optimizing, winner-take-all computations

2016-03-29T15:00:00+00:00

Genewein T, Braun DA (2016). Bio-inspired feedback-circuit implementation of discrete, free energy optimizing, winner-take-all computations. Biological Cybernetics. doi: 10.1007/s00422-016-0684-8

Our paper on bio-inspired feedback circuits for free-energy optimization was published. In the paper we explore the idea that a free energy minimization (which is the basis of an information theoretic framework for bounded rational inference and decision-making, see the corresponding Research article) can be described by a dynamical system.

Bounded rational decision-making (including Bayesian inference as a special case) requires the accumulation of utility (or evidence), to transform a prior strategy (or belief) into a posterior probability distribution over actions (or hypotheses). Crucially, this process cannot be simply realized by independent integrators, since the different hypotheses and actions also compete with each other. In continuous time, this competitive integration process can be described by a special case of the replicator equation (known from evolutionary biology for describing evolutionary competition between different populations). Here we investigate simple analog electric circuits that implement the underlying differential equation under the constraint that we only permit a limited set of building blocks that we regard as biologically interpretable, such as capacitors, resistors, voltage-dependent conductances and voltage- or current-controlled current and voltage sources. The appeal of these circuits is that they intrinsically perform normalization without requiring an explicit divisive normalization.

However, even in idealized simulations, we find that these circuits are very sensitive to internal noise as they accumulate error over time. In the paper, we discuss in how far neural circuits could implement these operations that might provide a generic competitive principle underlying both perception and action. In short: the problem is that a naive implementation of the replicator equation requires synaptic weights to change on the same time-scale with which the (perceptual) input to the system varies. This cannot be mapped to any of the standard models of neural competitive integration (e.g. pooled-inhibition models). This observation naturally leads to the question of whether signatures of the replicator-dynamics might be evident in other, perhaps more intricate, neural circuitry. While this question is beyond the scope of the paper, the novel point of view on competitive utility (or evidence) integration with dynamical systems as presented in the paper could be very interesting for researchers working on the neurophysiological basis of decision-making.

Talk @ Bosch research Renningen: hierarchical decision-making in perception-action systems

2016-02-29T11:00:00+00:00

I was invited to give a talk at the Cognitive Systems and Machine Learning group of Bosch Corporate Research located at the recently opened Bosch research campus in Renningen. My talk was about the theoretical aspects of my work - starting off with an introduction to information-theoretic bounded rationality and then moving on to how perception-action systems can be modeled and how bounded-optimal decision-making hierarchies can emerge as a consequence of acting optimally under limited computational resources. More information on all topics can be found on the Research pages.

After giving my talk I spent the rest of the day talking to researchers and students of the various groups, which I really enjoyed. I was impressed by the diversity of the research projects and was positively surprised by the great atmosphere.

Why we need to stop using rainbow colormaps: an interactive example

2016-02-15T11:00:00+00:00

This article contains an accompanying Jupyter notebook, launch it from your browser with this button (if Binder plays along, otherwise use the Github repository):

In the early nineties some software company decided to make a rainbow colormap the default-colormap. Soon others would follow and today data-visualizations are flooded with rainbow colormaps. You can find them in many scientific publications in all kinds of fields of research, but they also appear as default options in commercial software across all kinds of domains.

People in the late nineties already pointed out that rainbow colormaps are a particularly bad way of visualizing data - despite the shiny and fancy looks. There’s a whole bunch of problems associated with rainbow colormaps and the Internet is full of articles, blog-posts and scientific papers, pointing out the flaws and providing better alternatives. For instance, one study found that (future) medical doctors preferred rainbow colormaps for detecting anomalies in some processed imaging data (for detecting coronary artery disease), mostly because they were so used to this particular colormap which has become the de-facto standard in the medical literature. However, the same study found that participants’ detection accuracy was increased with a different colormap, even though participants were not particularly used to that colormap (see accompanying Jupyter notebook for a reference to the study).

Meanwhile there are open letters asking scientific communities to abandon the rainbow and journals to start banning its use as it is considered poor visual communication, particularly since better methods a readily available. And indeed, almost all software packages for visualization and scientific computing provide a whole range of colormaps, a lot of which are much more suitable than rainbow colormaps. Despite that, the scientific literature is still flooded with images in rainbow colors.

See the accompanying Jupyter notebook for an interactive example that clearly illustrates the problems with rainbow colormaps (and perceptually non-uniform colormaps). You can also find more information and further reading in the notebook.

The notebook can (potentially) be run directly in the browser using Binder, even though its use is still experimental with occasional hiccups (Feb. 2016, big kudos to Binder for providing this much needed platform).

Launch the notebook in you browser with the button below:

Alternatively find the notebook for viewing or downloading in my Github repository.

Paper published: Bounded rationality, abstraction and hierarchical decision-making: an information-theoretic optimality principle

2015-11-11T15:00:00+00:00

Genewein T, Leibfried F, Grau-Moya J, Braun DA (2015). Bounded rationality, abstraction and hierarchical decision-making: an information-theoretic optimality principle. Front. Robot. AI 2:27. doi: 10.3389/frobt.2015.00027

Our paper on an information-theoretic optimality principle that leads to the emergence of abstractions and also hierarchies of abstractions was published in Frontiers in Robotics and AI as part of the research topic on Theory and Applications of Guided Self-Organisation in Real and Synthetic Dynamical Systems.

At the core of the paper is the consequent trade-off between large expected utility and low information processing cost (as measured by the mutual information). The optimization of this trade-off leads to a quantitative framework for bounded rational decision-making, that is optimal decision-making under information processing limitations. In the paper we present an extension of the basic trade-off to more complex architectures that involve more computational nodes. We find that this extension leads to non-trivial solutions that allow to design bounded-optimal decision-making hierarchies. In particular we present two architectures in the paper

Serial hierarchies: the serial hierarchy consists of a perceptual channel and an action channel both of which are limited in their information processing rate. The perceptual channel maps a world state to an observation or internal percept. Since the rate on the channel is limited, the mapping must be lossy or bijective, implying that a one-to-one mapping between world states and percepts is not possible. Classically the perceptual stage is designed independently of the action stage and the goal of perception is to represent the true world-state as faithfully as possible (given the perceptual limitations). By applying our optimality principle we find that perception should be tightly coupled to action and that the goal of the perceptual part is to extract the most relevant information about the world-state such that the action-part of the system can work most efficiently (given its computational limitations). The optimality principle yields a well-defined likelihood function for bounded-optimal perception and also answers the question how to design the action-part of the system in a bounded-optimal fashion.
Parallel hierarchies: In a parallel hierarchical architecture there are two levels of computation: the upper level partially processes information about the world-state and then narrows down the search space over possible actions. The lower level searches within the narrowed down search space for “good” actions in response to the world-state. For instance, imagine that the optimal response is a certain parameter-setting of the movement-controller. The upper level narrows down uncertainty over good parameter settings with a distribution over the parameters - this distribution over parameters is often called a model. The lower level of the hierarchy uses the distribution imposed by the model (that is the reduced search space) and computes a policy that maximizes expected utility given the computational limitations of the lower level. The important question is: who designs the models (the distributions over parameters imposed by the models) and given some observations, which model should be used? By applying our optimality principle to parallel hierarchies we find principled, bounded-optimal answers to both questions. Importantly, the emerging two-level architecture is shaped by the utility-function and the computational constraints of the agent. Additionally, since computational limitations are an integral component of the optimality principle, computational tractability of the solutions can be guaranteed. In the paper we also show how the two-level hierarchy can be used to split up information processing load between both levels of the hierarchy - in the model-parameter example this means that uncertainty over the optimal parameter(s) is reduced in two steps: first on the upper level by selecting the correct model and then on the lower level by processing more information about the world-state.

The paper also contains a unifying general case that includes the serial and parallel hierarchies as special cases and is a promising starting-point for generalizing towards more complex architectures.

See the Research pages for more information on the information-theoretic principle for bounded rational decision-making and its extension to decision-making hierarchies. The work presented in the paper is a follow-up on our earlier work on the formation of abstractions through information processing limitations and was partially presented in a talk at the seventh workshop on GSO.

Paper published: structure learning in Bayesian sensorimotor integration.

2015-08-26T14:31:51+00:00

Genewein T, Hez E, Razzaghpanah Z, Braun DA (2015) Structure Learning in Bayesian Sensorimotor Integration. PLoS Comput Biol 11(8): e1004369. doi: 10.1371/journal.pcbi.1004369

Our paper on structure learning in Bayesian sensorimotor integration got published in PLoS Computational Biology. In the experiment we showed that humans are able to extract higher-level statistical invariants in a sensorimotor task. Additionally we showed how a hierarchical Bayesian model could explain human behavior in the task (including extraction of statistical invariants).

What is structure learning?

Humans do extraordinarily well at extracting re-usable knowledge when learning a concrete motor task. For instance, when learning how to ride a mountain bike, humans can easily learn how to ride a different type of bike (e.b. a racing bike). The idea is that both tasks share some common structure, that is higher-level statistical invariants and that humans can extract these invariants and use them to quickly adapt to novel but similar tasks (“learning to learn”). This poses the question of how to separate knowledge into a concrete, task-specific and a more abstract, invariant part. One approach to tackle this problem is by using hierarchical Bayesian models, where the higher-level statistical invariants are captured in the upper levels of the hierarchical model (typically through distributions over the parameters of the prior, so called hyper-priors, that are shared across tasks). Adapting to novel tasks but also inferring the parameters of the hyper-prior can then be performed with (hierarchical) Bayesian inference. For more on this see for instance Learning overhypotheses with hierarchical Bayesian models, Kemp, Perfors, Tenenbaum (2007) Developmental Science.

Experimental study

In our experiment we investigate structure learning, that is the extraction of higher-level statistical invariants, in a sensorimotor task. We used our 3D virtual reality setup to design a reaching task with a visuomotor shift. This means that participants had to reach from a start-point to a target using a virtual cursor. Crucially, the cursor did not represent their hand-position veridically but could be shifted horizontally and vertically (the reaching movement was orthogonal to the horizontal-vertical plane). This shift was drawn from a 2D Gaussian distribution in each trial. Participants would not see their virtual hand-position represented by the cursor during the movement but halfway through the movement they would receive brief visual feedback about the cursor position. This feedback can be combined with the experience from the previous trials (the learned prior) in order to initiate a corrective movement in order to hit the target. It has been shown previously that in such a task humans integrate prior knowledge with uncertain feedback information in a way that is quantitatively consistent with Bayesian inference (Bayesian integration in sensorimotor learning, Koerding, Wolpert (2004) Nature) - that is weighting the two sources of information according to their reliability.
In our task there were four types of visual feedback

Full feedback: precise feedback about the position of the (shifted) cursor - allows for precise corrections in order to hit the target without requiring knowledge about the prior-distribution over shifts.
Partial horizontal feedback: A “vertical bar” like stimulus that gives precise feedback about the horizontal dimension but no feedback about the vertical dimension.
Partial vertical feedback: A “horizontal bar” like stimulus that gives precise feedback about the vertical dimension but no feedback about the horizontal dimension.
No feedback: no visual feedback is provided an the reaching movement completely relies on the knowledge about previous shifts (the prior).

The two dimensional prior distribution over the shifts (the prior) allowed us to introduce non-trivial statistical structure between the two dimensions of the shift - structure that the participants could learn as higher-order invariants. In our experiment we tested two groups of participants, each with a different prior over the shifts (that stayed constant throughout the experiment)

Uncorrelated group: no correlation between the two dimensions of the shift. In partial feedback trials participants must rely on their knowledge about previous trials for movement correction in the uninformative feedback dimension (e.g. for the vertical bar feedback, participants did not get any information about the vertical dimension of the shift and had to use their learned prior for this dimension).
Correlated group: full correlation between the horizontal and vertical dimension of the shift. If participants had learned the correlation structure of the prior then receiving partial feedback about one dimension of the shift provides knowledge about both dimensions of the shift (since they are correlated), allowing participants to hit the target precisely.

The prediction is thus that participants of the correlated and uncorrelated group should behave differently in partial feedback trials. In particular, if the correlated group is able to learn the structure of the prior, they should be able to hit the target with more precision in the partial feedback trials compared to the uncorrelated group.
We were able to confirm this prediction in the experiment. Additionally we found that learning of the mean of the prior over shifts was quite fast and robust across participants whereas learning of the correlation structure required a lot more trials and even after 4000 trials learning hat not yet flattened out.

Modeling with a HBM

To model participants’ behavior in the task we used a hierarchical Bayesian model (HBM) where we assumed a Gaussian prior distribution over the shifts with unknown parameters (the mean and covariance matrix). We placed a distribution over the parameters of the prior - the so-called hyper-prior - which was a normal inverse Wishart distribution (the conjugate prior for Gaussians with unknown mean and covariance). This allowed us to derive an online update rule for to infer the parameters of the hyper-prior. We trained the HBM on the same data that the participants were exposed to and found that by placing a strong prior-belief over uncorrelated covariance-matrices we could replicate the timescales of learning the correlation structure with the HBM. Similarly, the HBM could learn the correct mean of the prior and could be used to produce responses in each trial that are statistically similar to human behavior. Importantly, when the HBM was trained with uncorrelated shifts (and keeping all other parameters the same), we could also replicate human behavior in the uncorrlated group.

MLSS 2015 in Tuebingen

2015-07-24T14:31:51+00:00

From July 13th to 24th the machine learning summer school (MLSS) was held at our institute, the Max Planck institute for Intelligent Systems in Tuebingen. I was involved as part of the team of local helpers that made sure things run smoothly for all the participants.

It was quite interesting to be involved in (parts of) the organization and execution of such an event. The summer school was attended by more than 120 participants with a lot more applications (unfortunately the number of participants is limited due to space constraints of the lecture hall). As last time when the MLSS was held here in Tuebingen, the summer school featured an excellent selection of speakers and was complemented by loads of social activities that left room for participants to interact. I was impressed to see so many really interesting research projects presented by participants in the poster sessions and also how often similar questions appear in completely different contexts.

Talk @ seventh international GSO workshop: An Information-theoretic Optimality Principle for the Formation of Abstractions

2014-12-16T15:00:00+00:00

Genewein T (2014) An Information-theoretic Optimality Principle for the Formation of Abstractions. Seventh international workshop on guided self-organization, Freiburg.

I had the chance to present my recent work on an information-theoretic optimality principle for the formation of abstractions with a talk at the seventh workshop on guided self-organization in Freiburg. In the talk I presented the information-theoretic framework for bounded rationality that trades off large expected utility against low information processing cost (see the Research pages for more information or alternatively this paper).

The central idea of the optimality principle is that any change in behavior incurs computation and in an agent with limited computational capacity this computation is costly. The cost of computation should thus be traded off against achieving a high expected utility. Interestingly, the resulting principle is identical to the rate distortion problem (the information-theoretic framework for lossy compression) and has strong formal ties to free energy minimization in thermodynamics. The problem in lossy compression is essentially the same as when forming abstractions: relevant information must be separated from irrelevant information (noise). The rate distortion principle for bounded rational decision-making has a parameter that governs the trade off between utility and information processing cost and we could show that changing this parameter leads to the emergence of natural levels of abstraction.

At the end of the talk there is a short outlook on how to extend the basic trade off to more complex systems - in particular perception-action systems. The results of this extension are published here.

PhD course: Information Geometry in Learning and Optimization

2014-09-26T14:31:51+00:00

I participated in the PhD course on Information Geometry in Learning and Optimization from Sep. 22nd to 26th, hosted at the University of Copenhagen. The course was packed with great speakers, most notably the “father” of information geometry Shun-ichi Amari. The PhD course was very well organized and attracted a great crowd of participants with quite diverse backgrounds from all over the world. Copenhagen was also an incredibly cool place to visit (never seen that many bikes in a bike-lane before), hope to be back one day…

Autonomous Learning Summer School 2014

2014-09-04T14:31:51+00:00

I participated in the 2014 summer school on autonomous learning (Sep. 1st to 4th) at the MPI for Mathematics in the Sciences in Leipzig. The topics of the summer school were

learning representations,
acting to learn (exploration)
learning to act in real-world environments

with participants working across the fields of neuroscience, machine learning and robotics. I also had the chance to present my work with a poster (see download). The summer school was organized very well with many great speakers (including Shun-ichi Amari!). It was also very cool to visit Leipzig and see the MPI for Mathematics in the Sciences.

Paper published: Assessing randomness and complexity in human motion trajectories through analysis of symbolic sequences

2014-03-31T15:00:00+00:00

Peng Z, Genewein T, Braun DA (2014) Assessing randomness and complexity in human motion trajectories through analysis of symbolic sequences. Front. Hum. Neurosci. 8:168. doi: 10.3389/fnhum.2014.00168

In this paper we analyze the complexity of human motion trajectories and investigate whether humans can generate completely random movements. We also tackle the question of how to measure complexity in human motion trajectories. The difficulty is that many common measures of complexity are simply measures of irregularity that are maximized by completely random trajectories. However, completely random trajectories are not considered very complex to human observers. On the other hand, completely predictable (and thus regular) movements are not complex either. Maximally complex trajectories seem to lie somewhere in-between fully predictable and fully unpredictable movements: complex motions have some regularity but also enough variation to be somewhat unpredictable. In the paper we compare classical complexity measures against the effective measure complexity which is a measure that is maximized for sequences that are neither fully predictable nor completely random.

In our experimental study participants were generating motion trajectories by “drawing” with a virtual stylus in a 3D virtual reality setup. We asked participants to either draw letters or invent their own patterns or perform a movement that is as random as possible. We then analyzed the complexity of their movements using classical measures but also the effective measure complexity (EMC). With the classical measures we find that they are maximized for the random trajectories - however the EMC was maximal for the invented patterns, second largest for drawings of letters and lowest for random movements, which is in line with an intuitive judgment of motion complexity. Additionally we found that the random trajectories generated by humans still contained some regularity, implying that the trajectories were not completely random. In a second experiment we trained humans with a pursuit game where an artificial agent tried to predict where participants would move next while drawing a random trajectory. If the prediction was correct an error sound was played back and participants’ goal was to minimize the errors by being as unpredictable as possible. We found that with this training humans could learn to increase the randomness of their motions.

The results of this paper lead to the question of whether aesthetics and human judgment of aesthetics could be related to the effective measure complexity. The hypothesis is that we find patterns that are either fully predictable or fully unpredictable boring, whereas patterns that contain some regularity but also some surprise (in terms of unpredictability) seem to be interesting. A interesting question would then be whether patterns that have a high EMC would also be judged interesting or aesthetically pleasing. However, this question is subject to future investigations.

Paper published: Occam's Razor in sensorimotor learning

2014-03-26T15:00:00+00:00

Genewein T, Braun DA (2104) Occam's Razor in sensorimotor learning. Proceedings of the Royal Societey B 281:20132952. doi: 10.1098/rspb.2013.2952

Overview

Bayesian model selection
Bayesian Occam’s razor
Experimental design - marginal likelihood of a Gaussian processes
Results

Our paper on sensorimotor regression and the preference of simpler models got published in Proceedings B of the Royal Society.

In the paper we present a sensorimotor regression paradigm that allows for testing whether humans act in line with Occam’s razor and prefer the simpler explanation if two models explain the data equally well.

Bayesian model selection

We modeled human choice behavior with Bayesian model selection. In Bayesian model selection, models $M_1$ and $M_2$ are compared by taking their posterior probability ratio (given the data $D$)

$\underbrace{\frac{P(M_1|D)}{P(M_2|D)}}_{\text{posterior odds}} = \underbrace{\frac{P(D|M_1)}{P(D|M_2)}}_{\text{Bayes factor}} \underbrace{\frac{P(M_1)}{P(M_2)}}_{\text{prior odds}},$

where the model evidence or marginal likelihood is given by marginalizing over the model parameters $\theta$

$P(D|M_i)=\int P(D|\theta,M_i)P(\theta|M_i) d\theta.$

If the ratio of posterior odds is larger than one, model $M_1$ should be preferred - if the posterior odds ratio is smaller than one, the data is in favor of $M_2$ . If the strength of preference is ignored, the previous rule leads to a deterministic model selection mechanism. Additionally, the magnitude of the posterior odds ratio is an indicator of the strength of the preference which can be used to derive stochastic model selection mechanisms (for instance by combination with a softmax selection rule). In case of equal prior probabilities for both models, the quantity that governs model selection is the Bayes factor (sometimes called Occam factor), which is the ratio of marginal likelihoods.

Bayesian Occam’s razor

It is known that Bayesian model selection embodies Occam’s razor, that is if two models explain the data equally well, prefer the simpler model. Below are two intuitions why this happens - for a more rigorous discussion see Chapter 28 in David MacKay’s Information Theory, Inference, and Learning Algorithms (Cambridge University Press, 2003).

The marginal likelihood $P(D\vert M_i)$ is obtained by considering the likelihood of the parameters given the data $P(D\vert \theta,M_i)$ under all possible parameter-settings (by taking the integral, see above). A very complex model is highly likely to have a particular parameter-setting that explains the data very well and leads to a large value of the likelihood function (think of polynomial regression with polynomials of a high degree). However, because of the marginalization, all parameters have to be taken into account and a very complex model will also contain many parameter-settings that yield a very bad likelihood. In contrast, an overly simple model will on average neither explain the data very badly but also not very well, leading to a low marginal likelihood as well. Only models that are complex enough but not overly complex will get a large marginal likelihood score. Thus Bayesian model selection is intrinsically regularized.
A complex model can explain many data-sets and must thus spread its probability mass over a quite large range (or volume). In contrast, a simple model can only explain few data-sets and has its probability mass more concentrated. If observed data happens to fall in a region where both models can explain the data, the simpler model is very likely to have more probability mass in that region and will thus be favored.

Experimental design - marginal likelihood of a Gaussian processes

In our experiment we test whether humans follow Occam’s razor and prefer the simpler model if two models explained the data equally well. We translated the classical regression problem of finding a curve underlying noisy observations into a sensorimotor task: in our experiment participants would see some dots that represent noisy observations of an underlying curve. Their task was to draw their best guess of the underlying curve. In training trials, after participants had drawn their curve, they were shown the true underlying curve that generated the noisy observations. Importantly, the underlying curve could only be generated by one of two models - a simple model that leads to smooth curves and a complex model that leads to more wiggly curves. In training trials, participants were informed at the start of the trial about the generative model with a color cue. In test trials we showed participants the same stimulus (noisy observations) but with a neutral color cue that did not indicate the underlying model. We asked participants to draw their best guess of the underlying function, informing them that the generative model could only be one of the two models experienced previously. From their drawing we could infer their model choice and could thus test human behavior against theoretical choice behavior modeled with Bayesian model selection.

The two underlying models that generated the curves were Gaussian processes (GPs) with a squared-exponential kernel with either a long or a short length-scale (corresponding to a simple or complex model respectively). We found Gaussian processes very suitable for generating natural trajectories (which is much more difficult with polynomials for instance) but perhaps most importantly we could use the Gaussian processes to generate trials where the noisy observations can be fitted with both models equally well. Inspecting the (closed-form) analytical expression for the marginal log likelihood of the Gaussian process reveals that it is composed of a data-fit-error term and a data-independent complexity term

$\log P(y|X,M_\lambda) = -\underbrace{\frac{1}{2} y^\text{T} K_\lambda^{-1} y}_{\text{data-fit-error}} - \underbrace{\frac{1}{2}\log |K_\lambda| - \frac{N}{2}\log 2\pi}_{\text{model complexity}},$

where $y$ denotes (noisy) observations at locations $X$ , $M_{\lambda}$ indicates the models with different length-scales $\lambda$ , $K_{\lambda}$ is the covariance matrix of the GP obtained by evaluating the kernel function for all input-pairs $k_{\lambda}(x_i, x_j)$ (which depends on the length-scale $\lambda$ ) and $N$ is the number of observations.
Importantly, the model complexity term is independent of the observations $y$ (in our case the number of observations $N$ was fixed throughout the whole experiment) and is governed by the length-scale parameter of the different models. The complexity term consists of the log determinant of the covariance matrix and a factor that depends on the number of observations. One explanation why this term corresponds to a complexity term is that it is the entropy of the (posterior) GP (since the GP is essentially a multivariate Gaussian) - a large entropy would thus imply a large complexity and vice versa. See the publication for more discussion on the complexity term and also some simulations that confirm that the shorter-lengthscale model also incurs a larger complexity-term.

Given the analytical expression of the data-fit-error term that is independent of the model complexity allowed us to use the GP framework to create stimuli where the noisy observations lead to the same data-fit-error term under both models. We used these trials to test whether humans would prefer the simpler model as an explanation in case both models would fit the data equally well. Additionally, we could also generate trials where the data-fit-error term was different under both models and use the marginal likelihood ratio to predict theoretical choice probabilities according to Bayesian model selection (paired with a softmax selection rule).

Results

In equal-error trials where the data-fit-error term was the same under both models, we found that humans followed Occam’s razor and strongly preferred the simpler model. In general trials with arbitrary data-fit-error terms we found participants’ behavior to be quantitatively consistent with the theoretical model of Bayesian model selection combined with a softmax model selection mechanism.

Additionally we performed a control experiment to rule out that the preference of the simpler model was simply due to a preference for lower physical effort (as smooth trajectories require less effort to draw) by having participants indicate their model choice with a mouse-click an then drawing a trajectory automatically according to their model choice. We found that the physical effort of drawing had on impact no model choice in our task.
We also performed a control experiment to rule out that the preference was not due to model complexity but simply due to smoothness of trajectories. We designed the two GP models in a way such that the wiggly model became the simple model. To achieve this we used a wiggly mean-function in the short length-scale model but then tuned the parameters of the GP such that the trajectories generated by the model are very similar to the mean and have very little variation. This lead to a wiggly generative model that produced functions of high spatial frequency but with very little variation (simple model, because it only explains a very narrow range of observations) and a smooth generative model that produced a large variety of functions (complex model, even though the generated functions have low spatial frequency). See the publication for more details.