### Tim Genewein

Information-optimal hierarchies for inference and decision-making

DeepMind
London, UK

Email Scholar Github Contact Subscribe

# Distribution of information processing - hierarchical information processing

### Overview

This article extends the rate-distortion principle for bounded rational decision-making presented in this research article, make sure to read it first if you’re unfamiliar with the information-theoretic framework for bounded rationality.

In this article, the information-theoretic framework for bounded rationality is used to model distributed information processing, where the total information processing load is split up onto two processing stages. These two stages can be interpreted as two levels of an information processing hierarchy, where the high-level decision narrows down the search space for the low-level decision.

### Processing information for acting

In the rate-distortion case, a bounded-rational decision maker is presented with world-states $w$ and computes actions $a$ (or a posterior policy $p(a \vert w)$) in order to maximize the (expected) utility $U(w,a)$. The computational resources of the agent are bounded by an upper limit on the mutual information:

In the serial action-perception case, the basic setup is extended by a perceptual variable $X$, which leads to a serial chain of two processing stages: a perceptual stage and an action-stage. However, in this setup, the overall information processed is upper-bounded by the capacity of either stage, meaning that the overall information processed cannot be larger than either of the information throughputs of the two serial stages:

or in other words: if all we care about is maximizing utility, then the serial case requires to pay (at least) twice for the same amount of information processing compared to the rate-distortion case that consists of a single processing stage only.

### Distribution of information processing

The goal in a parallel hierarchy is to split the total information processing load onto several elements of limited capacity such that in the end the information processed is larger than the information processed by each element (or processing stage). The approach taken here also introduces an “intermediate” random variable $M$, but in contrast to the serial case, the joint-distribution factorizes as follows:

This opens up the possibility for a distribution of information processing load on two stages, a high-level and a low-level stage:

The reason why the first stage is referred to as “high-level” is because it corresponds to a high-level decision that narrows down the search space over actions for the low-level decision of the second stage (more on this further below).
Crucially, the distribution of information processing also holds when considering the mutual information between $W$ and $A$, which is the amount of information processing that effectively contributes towards maximizing expected utility:

### Hierarchical decision-making: introducing models

In the following we refer to the variable $M$ as the model and use it to induce (prior) distributions over the action $A$ according to $p(a \vert m)$. The (low-level) decision-maker uses these prior distributions in addition with information about $w$ to compute a (bounded) optimal response $a$. Each model can reduce the effective search-space over actions - the difficulty then lies in choosing the correct model, given a world-state $w$, but also in finding good distributions $p(a \vert m)$. The following three components must be specified:

The low-level decision-maker uses the distribution induced by the model as a prior and computes an optimal action in a bounded rational fashion. Following the information-theoretic principle for bounded rationality, the low-level stage can easily be written down as:

The design of the model-selector $p(m \vert w)$ and good model priors $p(a \vert m)$ is crucial for the efficiency of the whole system and is in general a non-trivial task. In the section below it is shown how the consequent application of trading off expected utility against computational cost leads to a well-defined, bounded-optimal solution for both problems.
Note that the model-selector can be considered as another, high-level, decision-maker that narrows down the search-space for the low-level decision-maker - the overall system can thus be considered a simple two-level decision-making hierarchy.

### Information-theoretic bounded rationality

Consequent application of the information-theoretic framework for bounded rationality implies that the expected utility must be traded off against the cost of computation - in this case, computation is required to determine the correct model and another computational step is required to find an action given the prior induced by the model. Formally, the following optimization problem needs to be solved:

### Bounded-optimal solution

The objective above, can be solved in closed-form, leading to a set of self-consistent equations:

where $p(w \vert m)$ is given by Bayes’ rule $p(w \vert m)=\frac{1}{p(w)}p^*(m \vert w)p(m)$.
$\Delta F(w,x)$ is the free energy (difference) of the low-level stage

and acts as a utility for the high-level stage, that is the high-level stage maximizes the free energy trade-off of the low-level stage in a bounded rational fashion (since it appears in the exponential term of the solution for the high-level model selector). Additionally, $p(m)=\sum_w p(w)p^*(m \vert w)$.

### Information-optimal hierarchies

The solution to the optimization problem fully specifies all three components of the two-level hierarchy:

• a model selector $p^*(m \vert w)$
• model priors $p^*(a \vert m)$
• the low-level decision-maker $p^*(a \vert w,m)$

Note that the bounded-optimal low-level decision-maker turns out to be exactly the one that was intuitively motivated earlier (in the paragraph where the model variable was introduced).

Also note that the structure of the hierarchy is governed entirely by the utility-function, the distribution over world-states $p(w)$ and the computational limitations of the two stages (controlled by $\beta_1, \beta_2$ ). The hierarchy thus emerges from the utility-function and the computational limitations. Importantly, it is most economic to put the most re-usable computation into the high-level stage (by specifying “good” model priors), therefore models induced by the high-level stage can be regarded as abstractions (narrowing down the search space over more specific actions) and the whole hierarchy becomes a simple hierarchy of abstractions.

To fully understand the resulting hierarchical model, please refer to the more detailed discussion in this paper - it explains why the inverse temperatures must be set in a particular way ($\beta_1 > \beta_2$), why it makes sense to limit the cardinality of $M$ and why this will lead to the most re-usable and thus most abstract information processing happening in the high level of the hierarchy (thus leading to a hierarchy of abstractions).
Additionally the paper illustrates the bounded-optimal hierarchy with an intuitive example, which can also be found in this Jupyter notebook (or alternatively can be viewed in the browser here).