## Abstract

Despite many debates in the first half of the twentieth century, it is now largely a truism that humans and other animals build models of their environments and use them for prediction and control. However, model-based (MB) reasoning presents severe computational challenges. Alternative, computationally simpler, model-free (MF) schemes have been suggested in the reinforcement learning literature, and have afforded influential accounts of behavioural and neural data. Here, we study the realization of MB calculations, and the ways that this might be woven together with MF values and evaluation methods. There are as yet mostly only hints in the literature as to the resulting tapestry, so we offer more preview than review.

## 1. Introduction

Animals and humans often occupy environments in which there is a series of distinct states which are linked by possibly stochastic temporal dynamics or transitions. In turn, these states are associated, again possibly stochastically, with outcomes that can be appetitive or aversive. In order to choose actions to obtain the appetitive outcomes and avoid the aversive ones, it is typically necessary to make predictions about the net utility of the long-run delivery of these outcomes, given a particular candidate action.

There are widely believed to be at least two structurally different methods for making such predictions. In the context of reinforcement learning, they are referred to as model-free (MF) and model-based (MB) reasoning [1,2], with the former being retrospective and the latter being prospective. Other fields that make somewhat similar distinctions use different labels, such as habitual and goal-directed [3–5]; not to mention yet more distant dichotomies such as type I and type II reasoning [6,7].

The MB prediction of long-run future utility starting from a state *x* is based explicitly on (i) a sum over the whole set (typically in the form of an expanding tree) of future states that the subject's (typically learned) model tells it to expect to encounter following that state and (ii) the outcomes that the model tells it to expect to receive at those potential future states, each endowed with a utility that is assessed by a second aspect of the model. By contrast, the MF prediction depends only implicitly on these quantities. It is underpinned by the observation that since long-run utility is at stake, the prediction made at state *x* should be consistent with the equivalent predictions made at the neighbouring states which the subject visits following *x* [2,8]. On average, the only difference should be the utility provided at *x* itself. Thus, the MF prediction arises from learning from past encounters of state *x* to minimize untoward inconsistencies. Even more extreme MF systems directly couple observations to decisions, for instance using the inconsistencies to learn appropriate actions or action sequences [9–11].

The differences between MB and MF learning matter. First, they are known to have quite disparate computational and statistical properties [1,4]. MF predictions are computationally simple because they typically depend on little more than a feed-forward mapping of state to predicted future utility. However, they are statistically infelicitous, because adjusting long-run predictions by reducing local inconsistencies between neighbouring states' predictions fails to take account at every juncture of all the information that is known about the values of more distant states. Instead, this approach, known as bootstrapping, uses the value currently predicted at *x*'s successor state as a stand-in for the value of all rewards to follow. This biases the estimate of *x*, particularly at the onset of learning when all the values are poorly known. Conversely, MB methods are computationally ruinous, because there are usually very many possible future states in a long series of transitions. However, these methods are statistically straightforward, because learning can be temporally completely local.

Second, MB and MF methods have quite distinct psychological properties, which underpin a venerable debate about their relative importance [4,5,12–16]. Consider what happens when the environment changes, so that either the utility associated with a particular outcome or some state transition is different from its value during learning. The former could happen if the animal learned when hungry, but is now thirsty; the latter if a path in a maze is newly blocked or opened. Such adjustments can affect the utility that should be predicted at many states antecedent to where the change occurred. Because MF methods are based on the local reduction of prediction errors, there will be no update to predictions made at states far from the change until paths collectively leading there have actually been experienced. By contrast, MB predictions at distant states can alter straightaway given observation of just the changes, because these predictions are constructed on the fly using current information. In normal human subjects, it seems that choice can depend on both MB and MF methods, with the balance being tipped to the latter by a number of manipulations such as cognitively demanding dual tasks [17] or stress [18].

These rather clear psychological distinctions have, at least to a degree, helped elucidate the neural circuitry involved in MB and MF reasoning. Unit recordings and other studies have suggested that appetitive MF learning is supported by prediction errors carried by dopaminergic neurons in the midbrain, driving plasticity at their targets in striatum and elsewhere [19–21]. Meanwhile, lesion-based studies in rodents suggest that there is also a particular role for the dorsolateral striatum in MF control and possibly the central nucleus of the amygdala in prediction [22–27]. Equally, ventral prefrontal areas, the dorsomedial striatum, and possibly the basolateral nucleus of the amygdala are implicated in MB prediction and control [22,23,27]. These localizations have also been partly verified in a set of neuroimaging experiments in humans [28–31].

However, although these studies, and a wealth of others, are revealing as to the regions of the brain that are involved, much less is known about the actual mechanisms supporting MB control in the brain. To a great extent, this is because these experiments predominantly turn on simply detecting MB evaluation (e.g. testing whether choices or neural signals adjust appropriately when the environment changes) but tend, as a group, not to address the within-trial processes by which such computations occur. Such an experimental approach is well matched to theories of MB planning in psychology, which tend to be at Marr's computational level. A more detailed, algorithmic or process-level account might serve as a framework for interpreting the neural substrate. In reviewing different computational approaches to MB evaluation, this review aims to lay out a set of possible directions for filling in these gaps.

In so doing, and in contrast to existing psychological models (and to some extent, experimental tasks) that consider MB computation in small and tractable state spaces, we focus on the problems that arise in more realistic domains when the number of states or trajectories is large [4,32–34]. Then, it is not usually possible to perform a full MB calculation, i.e. to enumerate the full tree of future states and evaluate simulated net utilities. This motivates a search for alternatives. One prominent example is the substitution of MF for MB estimates at parts of the tree, thus obviating the requirement for full MB search. Doing this judiciously is a substantial meta-control problem [35–38] that we do not yet know how to solve. However, such interactions, if understood more systematically, might provide us with a basis for understanding various puzzling interactions between MB and MF systems that appear to be suggested by recent neuroscientific and behavioural data [39–43].

Here, we consider MB prediction from various algorithmic angles, discussing key methods and problems. We do not intend to be computationally, psychologically or neurally comprehensive, but instead mostly point to some possibilities that merit further examination. In §2, we define the problem for MB prediction in a Markov domain. In §3, we consider shortcuts of various sorts and their potential origin in learning. Finally, in §5, we touch on some of the neural implications and relationships for these notions.

## 2. Model-based prediction

To begin, we lay out the problem of predicting long-run utility using a model. To streamline the discussion, we adopt two formal simplifications. First, much of the recent literature on MB methods has concerned control—i.e. the choice of trajectories of actions in order to optimize long-run utility [2]. However, many of the critical issues are raised by the simpler problem of *prediction* of the long-run utility, and so we initially focus on this. Second, following much work in reinforcement learning (RL), we assume that the problem is expressed in terms of states *x* that satisfy the Markov property that future states and rewards are conditionally independent of past ones, given the current state. In problems where this fails to hold for the most obvious definition of state, a generic manoeuvre is to define a (larger) space of auxiliary states that *are* Markovian. For instance, a hidden Markov model or a partially observable Markov decision process involves a latent or hidden state whose rendition in the actual observations is partial or confounded. However, although the observations therefore fail to satisfy the Markov property, the belief state, which is the posterior distribution over the latent process state given the observations, does [44]. In this way, the basic framework described below also applies to these settings.

Thus, consider an uncontrolled Markov chain over states with transition structure and expected reward vector . For convenience, we will consider deterministic rewards; the methods can be readily extended to the stochastic case. We will assume that the rewards are bounded.

We take the goal as being to estimate the discounted, long-run future expected reward starting from each state *x*
2.1
where 0 ≤ γ < 1 is the discount factor that downweights distant compared with proximal rewards, and the expectation is taken over the stochasticity of the transitions. We can explicitly expand the expectation as
2.2
and then, noting that can be expressed using the *t*th power of the transition matrix, we can derive two simpler forms
2.3
and
2.4
where expression (2.4) is a form of consistency condition between the values of successive states. By constructing vectors and , we can write equation (2.3) as
2.5
and noting the matrix identity that we can write this as
2.6
The vector form of equation (2.4) is
2.7

We sometimes write the true value of **v** as . Assume now that we have estimates and of and **r**. What should the estimate be of **v _{*}**, and how can we actually perform the calculations concerned? Although there are other options, one obvious estimate is This is called the certainty-equivalent value, because it comes from substituting the estimated values into equation (2.6) as if they were true. In terms of calculations, the predictions then also satisfy two simple relationships:
2.8
which is the approximated vector form of equation (2.5)
2.9
which is the approximated vector form of equation (2.9). These relationships underpin MB and MF algorithms.

### (a) Enumeration and roll-outs

In order to compute values from an estimated model and , equation (2.8) invites us to perform either a deterministic calculation or stochastic ‘roll-outs’. The deterministic calculation would realize the steps of the addition in the equation, up to a finite horizon *s* by which point is sufficiently small relative to the maximal reward. However, this sort of evaluation mechanism is only credible in the case that there are very few states.

In the more general case of many, or even an infinite number of, states, it is possible to use a stochastic sampling method to focus an approximation around any state *x* [45]. One option is the simple Monte Carlo scheme of sampling trajectories from , along with sample rewards , and then averaging the discounted sum of encountered utilities across the samples. One distinct advantage of this is that it suffices to have a ‘black box’ simulator—thus even if there is an infinite number of states, so that writing down is impossible, it could still be viable to generate sample transitions.

There are two sources of error in such a scheme. The first is the obvious one of only using a finite number of samples; this implies that the average will be corrupted by noise whose standard deviation will decrease with the inverse square root of the number of samples. The second source of error comes from the finite truncation of the infinite horizon.

Because this sort of stochastic evaluation is focused on a particular starting state *x*, it is not guaranteed to provide the information necessary to evaluate any other state. The latter is not an accident—it is actually a fundamental aspect of this method (called sparse-sampling; [45]). It is, perhaps surprisingly, possible to evaluate expectations well without enumerating anything like all the states.

### (b) Consistency

The alternative to enumeration, calculation and roll-outs is to consider equation (2.9) as suggesting a consistency condition that applies between values of and when it is possible to make a transition from *x* to *y* in one step. Inconsistency, i.e. when the two sides of the equation are not equal, can be a signature for learning.

One standard technique is Bellman evaluation, which consists of starting from any , and then performing the iteration which comes directly from the consistency condition of equation (2.9) 2.10

Because is a stochastic matrix, it is possible to show that this is a contraction (in an norm), in that 2.11 (taking the maximum over the elements of the vectors) which implies that converges exponentially quickly to the correct answer [46].

Note that for , this method closely corresponds to the enumeration scheme from equation (2.8), with each iteration corresponding to one step of the addition. Apart from skirting explicit computation of powers of the transition matrix, the recursive evaluation here suggests a different possibility for sample-based evaluation (based on single transitions rather than on full trajectories), described next. In addition, the possibility of initializing the iteration with a set of values other than zero suggests many possibilities for using MB evaluation to refine pre-existing estimates, discussed below in §3.

Replacing by is usually called performing a *backup*, in that information from the *future* of state *x* is used to change the estimated value of state *x* itself. Of course, until is itself correct, the information that is being backed up is not completely faithful. It is in this sense that enforcing consistency is known as a bootstrapping technique.

One could also imagine simply visiting state *x*°, then generating a sample state *y*° from and using as a sample backup. In this case, it is necessary to average over multiple samples to overcome the inherent stochasticity, and so to perform a partial backup
2.12
Here, is a learning rate; the term it multiplies is called the temporal difference (TD) prediction error [47]. Provided each state is visited infinitely often, and some other technical conditions are satisfied, this scheme will converge appropriately: . We discuss the MF use of this rule in §3a.

### (c) Roll-outs and consistency

It is possible to combine rollouts and consistency, as in the original sparse-sampling algorithm [45] and varieties of Monte Carlo tree search techniques [48]. In a variant that was really intended for control rather than just for prediction, one idea is to build a tree-structured representation of the value of a subset of states progressively, one node (representing one state's value) at a time. Evaluation begins at the root of the tree, which is the current state. Sample transitions move the state down the tree (with sample rewards also being noted). Any time a leaf of the current tree is reached, a (curtailed) roll-out is performed without building any more tree structure; the rewards and values found are backed up in the existing tree along the sample path (using a version of equation (2.12)). Then, a new node is added to the tree at the leaf that has just been evaluated, and a new evaluation commences from the root. Under suitable assumptions about the Markov decision process (MDP), it is possible to bound the expected error following a certain number of roll-outs.

There are two main advantages of making the tree explicit and using back-ups, and one main disadvantage. One advantage is that if the actual transition in the world follows one of the paths in the tree, as is likely given sufficient tree-based evaluation, then less work needs to be done in the next iteration, because part of the tree will remain relevant. A second stems from reducing the variance, because the values at intermediate nodes in the tree that are used as part of the back-ups, have themselves benefited from being averages over a set of stochastic samples on previous trials. The disadvantage is that it is necessary to represent the tree, which can be expensive, at very least in terms of memory.

### (d) Discussion

The methods discussed in this section use more or less expensive calculations to make predictions prospectively based on and . Such predictions (and similarly decisions, in the case of control) conform to the key signature of goal-directed behaviour [5] that they immediately reflect changes in either quantity. In particular, consider a new reward vector (e.g. induced by a change from hunger to satiety) and/or a new transition matrix (representing different contingencies, such as a rearranged maze). If a subject were to learn about such changes, then MB computations based on the new model would (correctly) change the predicted values straightaway to the new implied values . This change is missed by conventional MF methods.

It is important to be able to quantify the uncertainty about values such as —in this case depending on a given amount of enumeration, roll-outs or samples. Although doing so precisely is extremely demanding in terms both of prior knowledge about the domain and computation, it is often possible to bound the uncertainty, given relatively coarse extra information such as the range of possible rewards and probabilities. In principle, one would like to exploit this information to target future computation more precisely, as a form of what is known as meta-control [35,37,38]. We discuss this further in §4.

## 3. Ameliorating model-based methods

The trouble with the MB evaluation algorithms discussed in §2 is that they pose severe challenges to computation (by requiring multiple iterations or roll-outs) and in some cases additionally to memory (through the size of the tree). This motivates an examination of alternatives that use estimates or approximations to simplify or replace the MB calculations. Two main suggestions have attracted substantial attention: (i) making and storing direct estimates of the long-run utilities of states that can substitute for the complexities of calculation, and (ii) employing forms of temporal abstraction.

### (a) Direct value estimates

It was an early triumphs of artificial intelligence to show that it is possible to acquire estimates of the *endpoint* (or even of ) of MB calculation directly from experience of action transitions and rewards without building or searching a model (hence the MF moniker). This involves enforcing consistency between estimates made at successive states, an idea from Samuel [8] that led to the TD learning rule of Sutton [47] mentioned in §2b. It can be seen as substituting actual transitions and utilities sampled from the environment for the simulated experience that we discussed as underpinning various forms of MB evaluation.

That is, the accumulated, discounted sequence of rewards received following the visit to some state is a sample of its value, and, like simulated roll-outs, these samples can be averaged over state visits. In addition, exactly analogous to the sample backups of equation (2.12), a prediction error can be computed from any observed state-reward-state progression, and used to update the estimated value of the earlier of the states. This is the TD method [47] for online value estimation. A variant called TD(*λ*) interpolates between the two extremes of estimating a state's value in terms of the full sequence of rewards that follows it, versus bootstrapping based only on consistency with the immediate reward and the value of the next state. These sampling methods clearly have the same convergence guarantees as the corresponding MB sample methods.

Given that we can produce estimates of values, what can we do with them? Most obviously, MF quantities can replace MB ones as estimates (or as their policy-maximizing counterparts for control). Perhaps the more interesting question from our perspective here, however, is whether they can be used in conjunction with the MB schemes in §2, to improve on the performance of either alone [49]. The main idea is that they can provide a starting point for iterative improvement via further MB computation. Thus, for instance, in the recursive Bellman backup scheme of equation (2.10), values can be initialized to , potentially improving the bootstrapping backups and ultimately speeding convergence, if these starting points are close to . Similarly, the iterated sum from equation (2.8) can proceed progressively through a series of terms, but then instead of truncating the sum, terminate with the approximate values . Generally, in tree traversal, estimated values can be substituted so as to treat a branch like a leaf. Finally, in just the same way, MB methods using local backups, sample roll-outs or transitions can be applied locally to improve the existing estimates at any given state *x*.

Such a combination of MF estimates with MB refinement is motivated by the fact that estimating from MF experience is so closely related to some of the methods we discussed for computing from a model. That both involve accumulating samples, though from different sources, makes it seem natural more freely to intermix both sorts of samples, experiential and model-generated, in updating the same vector. This is the idea behind the Dyna architecture [50], and since has been refined to take better advantage of search trees [51,52]. These considerations suggest that it may make sense for MF learned values and model-derived computed values to share a single value store, which is both updated online from experience and subjected to refinement with MB methods. This store can be seen as caching the conclusions from experience in the world, expensive MB evaluations, and also ersatz experience generated from a model.

Viewed this way, the question becomes whether, and at what states, it is worth spending the computational and memory costs of MB computation to refine pre-existing value estimates, and, conversely, whether learning directly will permit computational savings relative to purely MB computation, and at what cost. We consider aspects of this in §3c.

### (b) Abstraction

A second approach to ameliorating the problems of evaluating long-run returns involves temporal abstraction. It can apply to MB and MF evaluation; we describe it first for the latter in the context of feature-based representations of states.

In §3a we considered a separate MF prediction for each state *x*. However, it is more common to represent the states using a set of features, and then to realize as a linear function of the features. In these representational schemes, if the value of the *y*th feature at state *x* is *X _{xy}*, then we write , where are the estimating weights (that then become the target of TD learning). Collectively, this makes
3.1

Consider using as a representation , an estimate of [53,54]. Comparing equations (2.6) and (3.1), it is apparent that if the weights were just the immediate rewards, then MF predictions would be . To put it another way, in matrix , feature *y* represents the discounted future occupancy of state *y* starting from each initial state. This is therefore called the *successor matrix.* The value of state *x* is a sum over such future occupancies, each weighted by the immediate reward *r _{y}* available at the states involved.

It is straightforward to learn estimates of the immediate rewards. The successor matrix itself can be estimated via sampling methods that are exactly analogous to those for estimating directly [53]. Recall that each row of the matrix represents the expected, discounted cumulative occupancy over all other states, following visit to some state *x*. Conversely, each column counts future (discounted) visits to some successor state *y*, as a function of different start states. A column of is thus equivalent to a value function for a Markov process in which unit reward *r* = 1 is earned for each visit to state *y*, and zero otherwise, defined by a Bellman equation analogous to equation (2.10). Thus, in turn, each column of can be learned via either of the sampling methods in §2 (trajectory or transition based, or in general, TD(*λ*)), where visits to state *y* are counted in place of rewards. In general, TD learning of requires updating an entire row of the matrix following the transition from some state *x* to *y*, according to the observed transition (1 for *y*, 0 elsewhere) plus a vector-valued backup of the discounted occupancies over all other states expected following *y*, i.e. .

The successor matrix is not only useful for MF evaluation. Because it summarizes and thus replaces the trees, roll-outs or iterations over state visits that are required to produce the MB computations based on equations (2.8) or (2.9), it can save almost all the computation associated with the implied algorithms. All that would formally be required is an MB estimate of the utility of the immediate reward at each state—a quantity that is in any case part of the model.

The way the successor representation aggregates future state occupancies is an extreme example of temporal abstraction. Other related ideas include multiscale temporal modelling [54] (itself generalized in [55]), which involves learning a world model corresponding to multiple steps of the Markov transition dynamics. This comprises powers of the transition matrix (e.g. for two steps) and the associated sums of rewards . These constitute Markov chains, as do arbitrary mixtures of them; each represents views of the same process at different timescales. In the context of MB evaluation, coarser timescale models can allow for deeper search, in effect by aggregating multiple steps of the world model together. They can be learned in a similar manner to the successor matrix.

Similarly, rather than entire transition matrices, individual sets of state transitions can also be aggregated. This arises mainly in the control case, where the set of one-step actions can be augmented by aggregate actions, known as options, which constitute an extended policy [56,57]. Again, an option includes both transition and reward models that aggregate the consequences of the extended action sequence. Because these pre-compute a chunk of the tree of future states (and potentially actions), they can be used to take large steps during ‘saltatory’ MB evaluation [56]. However, whereas the consequences of following a particular temporally extended option (in terms of end state and rewards received) are also easy to learn by essentially the same methods as discussed so far, the larger problem of option discovery, i.e. finding a useful set of temporally extended actions, is an area of substantial current research [11,56].

### (c) Trade-offs

All these MB and MF methods enjoy asymptotic optimality guarantees in terms of computation or experience or both. However, trade-offs arise pre-asymptotically. Importantly, error, relative to the true values , has two components. One is due to limitations from evidence, learning or representation: even when computed exactly, will generally not coincide with the true because the former is based on an estimated model and , for instance, itself learned from limited experience with sample transitions and rewards. Such error might be viewed as irreducible, at least at a given stage of learning. Note that if the state space is large or continuous, computations expressed entirely in terms of either MF values or the successor matrix may have different, and potentially favourable, properties, compared with working with models which would have only to be approximate [46,58,59].

However, value estimates may differ from further owing to computational insufficiencies: either from approximate or inadequate computation in the MB case (e.g. truncating sums, averaging few samples), or, in the MF case, from noise inherent in the way that and were themselves computed from experience during learning. In particular, as we have seen, typical methods for learning values or the successor matrix themselves centre on bootstrapping these quantities by sample-based updates analogous to equation (2.12). Such bootstrapping during learning can fall short, for a given amount of evidence, from reaching the final values that could have been obtained had a model been learned from the same evidence, and then enumerated or rolled out. In this regime, further refinement of the MF values using MB methods can improve the estimates.

A source of inaccuracy in MF estimates such as and that has been of particular interest in a neuroscience setting comes from the case of change in the utilities or transitions. Going back at least to Tolman [16], these manipulations are designed to allow the subject to update the world model (either or ) while not giving them experience that would allow standard learning rules to update MF values or, in some experiments, the accumulated transition information inside . Under such an experiment, the MF values are biased towards estimates of the old values **v** instead of the new values . Additional computation with the (correct) model could correct this bias.

In particular, insofar as these techniques blend MB evaluation with pre-computed steps, they might or might not pass as MB in the standard laboratory tests. For instance, predictions based on the successor matrix could adjust to new reward contingencies immediately, if the estimates of **r** in are replaced by **r**‘. This would not be the case for weights learned to an arbitrary feature matrix . Furthermore, it would apply only to manipulations of reward values; changes in the transition contingencies that would alter would still lead to errors if the successor matrix were not itself relearned. Thus, behaviour produced by the successor representation is not truly goal-directed, under the strict definition [60] of being sensitive to both the action–outcome contingency and the outcome value : it would pass laboratory tests for the second, but not the first, owing to the way summarizes the aggregated long-run transitions. Similarly, whether values computed from multi-timescale models or options are able to immediately adjust to reward or transition changes would depend on whether or not these reflect rewards or transitions that are already cached inside their pre-computed aggregates (or, potentially, whether or not the way that the cached quantities might have changed since they were learned can be predicted or is known). The fact that temporally extended options, for instance, contain a model of their aggregate reward consequences, is the basis for a suggestion [10,11] that such action chunking (rather than classical MF value learning) might give rise to the experimental phenomena of habits.

As for whether, or at what states, to target additional computation, one view is that this depends on assessing the uncertainty and/or bias in the existing value estimates, which generally can be done only approximately [61–63]. Locally, this involves tracking uncertainty about the rewards or transitions both by taking account of what has been observed about them and how they are likely to have changed. The challenge here is that the values at different states are coupled to one another through the transitions, so localized uncertainty or change in the model at any particular state has effects on the value (and uncertainty about the value) at many other states antecedent to it via correlations that it is too expensive to maintain. Heuristics, like prioritized sweeping [64], target MB updates towards states likely to have been affected by recently learned examples.

In theorizing about biological RL [4,32,33], information about uncertainty has been proposed to explain, for example, why in experiments, animals favour MB computation early in training (when bootstrapped values are uncertain and there is likely value to MB computation), but MF values dominate following over-training on a stable task (when there is little uncertainty in any case) [65].

### (d) Control

As mentioned in §2, we have so far concentrated on prediction rather than control. Control is typically modelled by a Markov decision process in which the agent chooses an action at each state and the successor state depends jointly on the state and action. The critical extra complexity of control is that the *policy* (which is the mapping from states to possible actions) of the agent has to be optimized during learning. Applying any particular policy reduces a Markov decision process back to a Markov chain, by inducing a policy-dependent state–state transition matrix of the sort considered in §2. This implies that the values **v** (and also the state–action version of the values, known as values; [66]) both depend on the policy—more directly, because the value of a state depends in part on the actions taken at all subsequent states. An optimal policy is defined as one that jointly maximizes value at all the states.

The requirement for control leads to the extra problem of exploration versus exploitation [35,48,67]. That is, to the extent that uncertainty can be measured, the value of reducing it can be quantified, again at least approximately, in terms of obtained reward. This is because, in the case of control (i.e. if the values are actually used to make choices), more accurate predictions can earn more reward via producing better choices, though only to the extent the value estimates were unreliable in the first place. In this way, improving uncertain estimates can justify choosing actions that may reveal new information. The same trade-off also governs to what extent it is worth spending computational resources refining existing value estimates via MB evaluation.

The relationship between a Markov decision process and a Markov chain implies that most of the algorithmic methods can be adapted to control. For instance, there is a maximizing variant of the consistency condition in equation (2.9) that leads to a maximizing variant of the MB tree backup [68] and of MF value learning [66], both of which produce the value function associated with the optimal policy. While it is possible to define an analogous optimal successor representation (the successor matrix induced by the optimal policy), this cannot be learned in a simple MF way, separate from the optimal values. Instead, the successor representation and multi-timescale models fit more naturally with a different family of control approaches, which include policy iteration in the MB case and the actor–critic for MF learning [9,59]. These work by adopting some candidate policy, learning its associated value function and using these values to improve the policy.

The issue of policy dependence also plays an important role in Monte Carlo tree search, because the tree being rolled out depends on the assumed policy, which is also what the tree search is optimizing. There is a Monte Carlo tree search mechanism that tries out different actions according to a measure of their likely worth, but taking the potential benefits of exploration into account [48]. It distinguishes between an (approximately) optimal policy that is being captured by the parts of the tree that have already been built, and a roll-out policy which is a default that is applied at so-far-unexpanded leaves of the tree, and which is almost always suboptimal but provably sufficient for this purpose. For the case of prediction alone the roll-out policy is, in effect, what it is sought to evaluate, and so the prediction tree could have been built more aggressively, adding more than just a single node per step at the root.

With respect to MF methods, we have seen that caching quantities that depend on the rewards and transitions inside a value function , the successor matrix or other temporally extended representations can cause characteristic errors in prediction when the model changes. The fact that these quantities are both also policy-dependent (and also that the policy itself is cached in methods like the actor–critic) is another similar source of error. Thus, if the policy changes at a state, this should typically induce value and policy changes at other states. However, MF methods may not update these all consistently. Notably, although in the prediction case we suggested that the successor representation behaves like MB reinforcement learning with respect to changes in the rewards **r**, though not the contingencies , in the full control case, the successor representation may also fail to adjust properly to some changes in rewards alone. This is because in the control case, changes in rewards **r** can induce changes in the optimal policy, which in turn affect the effective transition matrix, , making it different from the one cached inside .

## 4. Neural realizations

We have discussed a number of variants of MB computation, focusing on the case of the prediction of long-run utility. We suggested that in problems of realistic size, MB methods are unlikely to be able to function in isolation, but might instead fall back on MF estimates, at least to some extent. Prediction learning is central to Pavlovian or classical conditioning, for which actions are elicited purely accordance with the predictions rather than because of their contingent effects on the outcomes. However, MB and MF computational issues also arise for the more significant case of instrumental or operant conditioning, in which it is necessary and possible to optimize the choice of actions to maximize rewards and minimize punishments. Indeed, most psychological and neuroscience studies of the two sorts of methods have been conducted in the instrumental case.

Substantial data distinguishing MB and MF instrumental control have come from brain lesion experiments in the face of a reward devaluation probe. Collectively, these appear to demonstrate a reasonably clean double dissociation between two different networks supporting MB and MF control, such that animals reliably behave in one or other fashion given lesions to specific, different, areas. Instrumental MB and MF behaviour appear to be centred on distinct areas of striatum: dorsomedial and dorsolateral, respectively [22,27,69–71]. Areas of striatum are interconnected with associated regions of cortex via topographic ‘loops’ linking the two structures, via additional basal ganglia nuclei and the thalamus [72–74]. At least in the case of MB behaviour, lesions implicate the entire loop, cortical (prelimbic) and thalamic (mediodorsal) regions affecting MB control [23,75,76]. There is also a key role for the orbitofrontal cortex in the flexible assessment of outcome worth that is essential for MB evaluation [24–26,77]. By contrast, cortical and thalamic elements of a hypothetical parallel MF loop through dorsolateral striatum have yet to be demonstrated. Although the results are less clean cut, neuroimaging in humans has revealed potential counterparts to some of these areas [29–31,69,78,79].

A third subregion of striatum, the ventral part (also known as nucleus accumbens) is more closely associated with Pavlovian rather than instrumental aspects of behaviour (e.g. prediction rather than control). It is well connected with parts of the amygdala, and also the orbitofrontal cortex, which are also implicated in Pavlovian behaviour. Importantly, as we discussed, predictions (as embodied in various Pavlovian behaviours) may themselves, in principle, be computed in either an MB or MF fashion. This distinction has not been nearly as carefully worked out behaviourally or computationally in the Pavlovian case [80]. Nevertheless, there does appear to be some behavioural and neural dissociation between MB- and MF-like Pavlovian behaviours, with the shell region of the accumbens, orbitofrontal cortex and the basolateral nucleus of the amygdala supporting the former (such as specific Pavlovian–instrumental transfer and identity unblocking), and the core of the accumbens and the central nucleus of the amygdala supporting the latter (general Pavlovian instrumental transfer) [22,27,81].

Further, the predictions that feed into Pavlovian actions, whether MB or MF, may also affect instrumental actions, though it is not yet clear exactly how or to what extent. Ventral areas of striatum may be implicated in the performance, if not the acquisition, of instrumental behaviour, consistent with an important role of Pavlovian processes in motivation and regulation of behavioural vigour in Pavlovian to instrumental transfer paradigms [5,82–85]. Perhaps for a similar reason, infralimbic cortex (which is more associated with ventral striatum than with dorsolateral) is also involved in habitual instrumental behaviour [23]. In addition, recent results with disconnection lesions suggest that the basolateral and central nuclei of the amygdala, in communication with dorsomedial and dorsolateral striatum, are required for acquiring MB and MF instrumental behaviours, respectively [86,87].

Collectively, these results are evidence for the mutual independence and disjoint neural realizations of these two sorts of valuation and control. This makes it hard to see how both MB and MF behaviours could be supported by a single, shared value store, as in the most literal implementation of Dyna-like schemes that we discussed in §2a. However, the apparent independence under insult does not rule out the possibility that they interact in the intact brain. Indeed, results from human imaging experiments suggest that the reward prediction errors that are thought to drive online MF learning, are themselves sensitive to MB values [14,39]. This may reflect an MB system training or refining MF values, as in some of the various methods discussed in §3a. Behavioural results in humans, from tasks involving the combination of instructed and experienced values [40] and those involving retrospective revaluation [42], may also be consistent with the possibility that the MF system is trained by the MB one.

We have also seen various ways in which MB learning and evaluation can be conducted, via samples, using techniques entirely parallel to the temporal-difference methods most associated with MF learning. There is a fairly well-accepted neural mechanism for performing appetitive MF learning in which the phasic activity of dopamine neurons conveys a TD prediction error to its targets, particularly affecting activity and plasticity at corticostriatal synapses [19–21,88]. It is tempting to suggest that, whether operating on the same synapses (as in Dyna [50]) or on parallel ones in, say, adjacent corticostriatal loops, samples of simulated experience from an MB system could use exactly the same mechanism to update values, leveraging or perhaps replicating, the brain's MF circuitry in the service of MB evaluation. The sample trajectories themselves might be produced by mechanisms such as the replay or pre-play of patterns of hippocampal activity expressing recent or possible paths through an environment [89–92], which is known to be coordinated with cortical and striatal activity [93,94].

We noted in §3c,d that the successor matrix used as a state representation for an MF scheme can adjust immediately to changes in reward contingencies (if predictions are based on rather than on ) though not changes in transitions (which require an alteration to which can only happen in an MF manner through suitable new real or simulated experience). This can be seen as exploiting TD learning to produce an MF method with some of the features of MB control. Other state representations might also afford useful generalizations, sharing the fruits of learning about one set of states with other sets. Unfortunately, we know rather little about the nature and evolution of cortical representations of states that underpin MF (and MB) predictions, although it is widely believed that these representations do capture and represent explicitly statistical regularities and predictabilities in their input [95–97]. It is indeed a popular notion that the cortex builds a sophisticated representation of the environment that should then afford efficient, hierarchical, MB and/or MF control, and indeed appropriate generalization between related tasks and subtasks. How this works is presently quite unclear.

All of these schemes, from Dyna to the successor representation, that repurpose the MF prediction error to serve MB ends predict dopaminergic involvement in both sorts of valuation. It is reasonably well demonstrated that dopamine does indeed affect MF learning [98–100]; however, there is as yet only mixed evidence for the relationship between dopamine and MB learning [31,43,80,101,102].

Finally, we note the new meta-control problem that has arisen from the need to arbitrate and integrate the multiple methods for making predictions and decisions that we have been discussing [36,103]. One approach to this that has been particularly prominent in the case of the control of working memory (in order to create the sort of history representation that turns a partially observable Markov decision problem into the sort of standard Markovian one we have considered here) is to treat the meta-control decisions as being just like external decisions. This would be realized by the same underlying circuitry (albeit potentially different regions of structures such as the striatum and the cortex) [104–110]. Of course, the potential regress to meta–meta-control problems and beyond needs to be avoided. Such approaches are tied to considerations about what makes meta-control choices appropriate or optimal; new ideas are emerging about this, notably the relationship to various cognitive limitations [111,112].

## 5. Discussion

From at least the time of Tolman [16], the advantages of MB reasoning have been clear. It offers subjects statistical efficiency and flexibility, i.e. an ability to deploy their latest knowledge in the service of their current goals. There are limitations in this flexibility. It turns out, for instance, to be very difficult for us (though not necessarily scrub-jays [113]) to make MB predictions about motivational or emotional states that we do not currently inhabit [114]. More significant though, and interpretable as a point of attack from the very outset [115], are its computational demands. The trees of future states (and actions) grow most rapidly in even moderate-sized domains, overwhelming the working memory and calculation capacity of subjects. Hierarchical approaches [116,117] are important, but pose their own challenges of provenance and inference.

We therefore reviewed alternative, MF, methods of reasoning, which operate retrospectively based on experience with utilities. Canonical versions of these are less flexible than MB methods because they rely on explicit experience to change their assessments; their use can thus be distinguished from that of MB systems when behaviour fails to change when aspects of the world change. There are various suggestions for how MF values might come to substitute fully [4,32] or partially [33] for MB values, as in habitization [5]. For instance, the relative certainties of the two systems might be weighed with control favouring the less uncertain [4]. Alternatively, the value of the potential information that could be gained by performing MB calculations could be assessed and weighed against the opportunity and/or cognitive cost of the MB calculation [32,33]. Certainly, there is much current emphasis in human and animal studies on finding ways of elucidating the differential engagement of the systems [12,14,28,31,39,118].

We focused on the methods of MB calculation, and the ways that MF values or MF evaluation methods might be embedded in the MB system and that MB behaviour can arise from purportedly MF systems. We saw very close parallels between various versions of each, for instance between methods that explore the MB tree using stochastic simulations in the model versus methods that learn MF values from sampled experience in the world. Indeed, there are many points on the spectrum between MB and MF that current tasks only dimly illuminate [12].

One possibility invited by these refined interactions between MB and MF systems is to consider MB evaluation at a much finer grain. We can envisage a modest set of operations: such things as creating in working memory a node in a search tree; populating it with an initial MF estimate; sampling one or more steps in the tree using a model; backing value information up in the current tree, possibly improving an MF estimate using the resulting model-influenced (though not necessarily purely MB) prediction error and finally picking an external action. These fine-grain choices lead to various internal (cognitive) or opportunity costs [32,33,119,120], and are plausibly the solution to a meta-control problem (i.e. concerning the approximately optimal control of control) [36,103,112]. This meta-control problem, which itself could have MB and MF elements to its solution, and presumably depends on learning over multiple tasks, is an important focus for future study.

Finally, many precursors of modern ideas about MB and MF planning were early concerns in the first days of the field of artificial intelligence, and grounded a tradition of experimental work in psychology. Some of these ideas underpin the techniques in reinforcement learning that we have been discussing [8,121,122]. Much computational research considered heuristics for tractably building, searching and pruning decision trees, and the use of value functions to assess intermediate position [123–126], to name but a few. In psychology, such algorithms were adopted as models of human behaviour in chess and other planning problems such as the towers of Hanoi or missionaries and cannibals [126–128]. Error rates, reaction times and self-reports in ‘think-aloud’ planning all appear to suggest the usage of particular heuristics to guide decision tree search.

The direct applicability of these search heuristics to computing expected future values in the RL setting is unclear, however, because the methods we have talked about address what can be seen as a wider class of decision problems than chess or the other planning tasks, involving such additional complications as computing expected returns with respect to stochastic transitions (and potentially rewards) and intermediate rewards along trajectories. Nevertheless, heuristic decision tree pruning has recently arisen as an important feature of human behaviour also in modern RL tasks [129], and other insights from this earlier work are starting to permeate modern notions of hierarchical control [56]. It thus seems a ripe time to revisit these models and results, and attempt to understand how they can be made to relate to the theoretical and experimental phenomena reviewed here.

## Funding statement

Funding was from the Gatsby Charitable Foundation (P.D.) and a Scholar Award from the McDonnell Foundation (N.D.D.).

## Acknowledgements

We are very grateful to Yael Niv and our many other collaborators on the studies cited who have helped us frame these issues, including Ray Dolan, Sam Gershman, Arthur Guez, Quentin Huys, John O'Doherty, Giovanni Pezzulo, David Silver and Dylan Simon. We thank two anonymous reviewers for very helpful comments. The authors are listed in alphabetical order.

## Footnotes

One contribution of 18 to a Theme Issue ‘The principles of goal-directed decision-making: from neural mechanisms to computation and robotics’.

- © 2014 The Author(s) Published by the Royal Society. All rights reserved.