One of the paradoxes of vision is that the world as it appears to us and the image on the retina at any moment are not much like each other. The visual world seems to be extensive and continuous across time. However, the manner in which we sample the visual environment is neither extensive nor continuous. How does the brain reconcile these differences? Here, we consider existing evidence from both static and dynamic viewing paradigms together with the logical requirements of any representational scheme that would be able to support active behaviour. While static scene viewing paradigms favour extensive, but perhaps abstracted, memory representations, dynamic settings suggest sparser and task-selective representation. We suggest that in dynamic settings where movement within extended environments is required to complete a task, the combination of visual input, egocentric and allocentric representations work together to allow efficient behaviour. The egocentric model serves as a coding scheme in which actions can be planned, but also offers a potential means of providing the perceptual stability that we experience.
We use our eyes to get the information we need to perform the tasks of everyday life. Some of this information may be in the image on the retina at the time, but usually it is not. If we need to find a mug to make a cup of coffee, that mug is unlikely to be in the central retinal region where we can easily recognize it. It is more likely to be either in peripheral vision where resolution is too poor to identify it, or else it is not in the field of view at all, in which case a reorientation is needed to locate it. This simple example shows that we require, in memory, a representation of the surroundings that is adequate to act as a basis for directing foveal vision to places where it is needed.
In principle, if the brain were able to join up and remember the entire series of images provided by the retinae each time we move our eyes, then we would have a complete panoramic memory that could be used to guide future actions. However, not only is the storage capacity implied by such a proposal unrealistically immense, but there is also a great deal of empirical evidence against it. How, then, is the spatial information required by an active visual system obtained, stored, updated and made available?
In this article, we will first briefly review current thinking about what is and what is not retained each time we move our eyes. This leads directly to the nature of the representations of space that are built up while viewing a scene. Inevitably, much of what has been learned has come from studies of static images where no complex actions are involved, and where the requirements of the representations involved are less exacting than those involving action in three-dimensional space. We then consider tasks that involve the manipulation of objects in proximate space but do not involve bodily relocation. Following this, we discuss the kinds of representation needed to operate in extended spaces that require movement around the environment, for example, within a kitchen while preparing food. Here, the problem is to retain a panoramic memory whose spatial contents are updated as we rotate and translate within the environment—the ‘egocentric model’.
Finally, we return to the question of why the continuous visual world we experience is so different from the temporally and spatially disjointed series of images provided by the retinae. Could it be that our subjective gaze direction is anchored not to the retinal image, but to the continuous representation we use to guide our actions?
2. What is (not) retained across eye movements?
Our subjective experience of continuous visual perception of our surroundings in spite of the temporally discontinuous and spatially restricted information supplied by the eyes inevitably poses the question: how does the brain achieve perceptual stability despite the nature of the input supplied by the eyes? This question has been asked by researchers since the saccade-and-fixate strategy of the oculomotor system was first observed . Initially, it was assumed that perceptual continuity arose from continued visual sampling during saccades (e.g. [2,3]). However, this notion was quickly and effectively dismissed when Erdmann & Dodge  happened to notice that while using a mirror to observe the eye movements of participants reading text, they were never able to see their own eyes moving. Thus, perception is suspended during saccades. Emphasis thereafter shifted to trying to understand how the brain might use the information sampled during fixations to construct an internal representation capable of giving rise to our complete and detailed subjective experience. For some time, it was thought that the pictorial contents of each fixation were fused to construct a point-by-point complete picture of the scene within the brain (e.g. ). However, an increasing body of research showed that there was little evidence to suggest that the pictorial content of fixations might be integrated [6–8]. In spite of these arguments against internal veridical representations, the notion of a complete picture in the brain survived until the advent of change-detection studies in the 1990s . A considerable volume of research has now demonstrated that observers can fail to notice large changes to scenes if they are timed to coincide with saccades , blinks  or brief artificial interruptions to viewing (, figure 1). Change blindness was largely interpreted as providing a strong case against the notion of point-by-point pictorial representations . If the pictorial content of each fixation were retained, then, it has been argued, it should be trivial to detect changes to the colour, location or even presence of an object. Following this logic, change-detection studies have been used to suggest that the pictorially veridical information from each fixation is lost every time we move our eyes.
Of course, one potential criticism of change-detection studies is that the changes that occur are typically very unlike those we may encounter in a natural context: clothes change colour, buildings move and foliage disappears instantaneously, during some brief interruption to viewing. A number of studies have therefore considered change detection in situations that are more ecologically valid. Extensions of change-detection research into dynamic scenes have suggested that detection is at least as difficult as was found for static scene viewing. Levin & Simons  found that changes to colour, presence, position and identity of objects during cuts between camera viewpoints were rarely detected. Even the main actor in a sequence can be changed during an editorial cut without always being noticed by the observer. Wallis & Bülthoff  extended this work by considering whether there were differences in detection ability depending upon the type of change made in a dynamic setting. These authors created movies of virtual environments simulating observer motion through the environment and compared this situation to static viewing of the same scenes. Wallis and Bülthoff found that the ability to detect the appearance/disappearance of an object was the same in the static and dynamic situations. However, for object orientation, position and, to a lesser degree, colour, detection performance was worse during the simulated observer motion than during static scene viewing. Further support for difficulties in detecting colour changes across cuts in movies has been provided by Angelone et al. . Hirose et al.  have explored object memory across changes in viewpoints in movies and suggested that position information is represented differently to other object properties in such dynamic scenes. This may reflect greater difficulties associated with spatial representations across changes in viewpoint when viewing dynamic scenes, compared with representing other object properties.
In their now-classic study, Simons & Levin  demonstrated that if two actors changed places during a brief interruption to an ongoing conversation (provided by a passing door!), people often failed to notice that they were holding a conversation with a different person from the one who they began talking to. This and the studies above have all been used to argue that much of the visual information must be lost whenever viewing is interrupted.
Further evidence for failure to retain visually rich information in natural settings has been provided by Tatler . This study did not employ a change-detection paradigm, but instead tested what visual information participants could access while engaged in a natural, everyday activity. Participants were interrupted by turning out the lights in a blacked-out room while they were making a cup of tea. When this occurred, they were able to give pictorially rich descriptions of the information that was the target of their foveal vision as the lights went out. However, they were unable to describe what they had been looking at prior to this. The stark contrast in reportability between the final target of fixation and prior targets argues for transient or no retention of visually rich information. However, a common error when attempting to report the final target of fixation at the point of interruption was for participants to mistakenly report the content of their penultimate fixation; this mistaken report was given with the same degree of detail and confidence as the correct reports of final fixation contents. Moreover, the probability that this type of error occurred was related to the time between the start of the final fixation and the time that it was interrupted by the lights going out. The existence of this class of error can only be explained if rich visual information is retained across saccades and for a short time into the new fixation, until it is overwritten by the content of the new fixation (figure 2).
3. Memory for static scenes
The phenomenon of change blindness has renewed interest in the nature of scene representation and a variety of explanations of how we encode and retain information from the visual environment have been argued.
(a) Current views of scene memory
First, change blindness has been used to argue that we do not construct an internal representation of our visual environment at all . According to this view, the most reliable source of information about our surroundings is the world itself. Because we have highly mobile eyes, we can redirect our foveae to the regions of the world we wish to scrutinize with relatively little cost and so there is no need to interrogate an internal representation rather than the world itself . The absence of any internal representation raises a number of questions about how our perceptions of the world arise: for example, if we only ever have access to the current retinal image, how do we form three-dimensional perceptions of objects? Here, O'Regan & Noë  appeal to Gibsonian notions of sensorimotor contingencies, whereby perceptions arise from the changes that occur on the retina across eye movements: the changes depend crucially on the three-dimensional structure of objects and so can be used to reveal this structure.
Second, Rensink  proposed that some visual detail can survive saccades, as long as it is the subject of focal attention. Thus, a limited number of proto-objects can be maintained as an object representation, but all unattended visual detail is lost. The attended information can be retained as a coherent object representation only for as long as it receives focal attention. Rensink  further proposed that this limited attentional coherence of visual detail was integrated with more abstract and higher level information about the gist and layout of scenes (figure 3). In this view, therefore, information survives beyond the end of a fixation, but only if and while attended. The number of items that can be preserved beyond the end of a single fixation is also very limited.
Third, a number of authors have suggested that visual representations may be less sparse than suggested by Rensink. Indeed, one could argue that change blindness need not imply that representation must be sparse or absent, and that failure to detect change may be due to a number of possible reasons . For example, representations may be formed that are not accessible to conscious scrutiny and therefore cannot be used to report the change. Support for this possibility comes from studies that have shown better than chance localization of a change in a stimulus array, even when the participants report that they are unaware of any changes (e.g. ). Another possibility is that while point-to-point visual detail may be lost from each fixation, other, more abstract information may be retained, but may be insufficient for supporting change detection. Both of these possibilities suggest that representations are formed that survive fixations, a notion supported by a growing number of studies demonstrating that object property information appears to be extracted and retained from the scenes viewed (e.g. [25–28]). However, while there is general agreement that information survives the fixation, the nature of the retained information remains the topic of continued research and debate. One possibility is that object information may be encoded into a limited number of object files [29–31]. These object files are temporary representations of objects maintained across several saccades. However, this representation scheme remains quite sparse, with an upper limit of three to five object files being able to be maintained at any time. Once the upper limit is reached, new object files can only be encoded and retained at the expense of existing files.
In contrast to the sparse scheme suggested by the object file account, some authors have suggested that richly detailed visual representations are formed when viewing scenes (e.g. [32–34]) that contain a large amount of detailed visual information and can survive for extended periods of time. Hollingworth argues that because observers can detect changes to objects that are as small as a change in orientation, the representations that underlie this detection must be visually rich in order to support such subtle distinctions. A similar case for high-capacity, visually rich memory for objects has been proposed by Brady et al. . After viewing 2500 objects for just 3 s each, observers were able to discriminate previously seen objects from paired distractors with impressive ability. This was even the case when distractors were different exemplars from the same object category or were the same object at a different orientation. The authors argue that such fine discriminations for objects drawn from such a large memorized set implies that object memory must be both rich in visually precise detail and immense in capacity.
While Hollingworth and Brady et al. argue for visually rich representations, other authors have interpreted essentially rather similar findings in different ways. Melcher [27,36] proposed the involvement of a higher level medium-term memory, with representations being less strictly visual and more abstracted. Tatler has also favoured a more abstract account of representation [28,37,38], but which can still include a large amount of information describing the objects in the scene. Recently, Pertzov et al.  have argued for a similar scheme of representation to that discussed by Tatler. It is hard to find empirical evidence that really favours any particular one of the interpretations suggested by Hollingworth, Melcher and Tatler and the nature of information retained from fixations remains an open question.
Common to all of these accounts is the finding that information survives beyond a single fixation and typically accumulates over prolonged viewing. While Hollingworth, Melcher and Pertzov have argued for a general increase in accumulated information over time, Tatler has suggested that different object properties are integrated into representations over different time scales. In general, during viewing, the overall scene gist and spatial layout seem to be extracted earlier than more detailed information about the properties of individual objects [28,40]. When multiple object properties were tested at the same time, Tatler et al.  found that object identity and colour did not accumulate over multiple fixations of an object, with maximal performance in response to questions testing these properties being reached after only a single fixation of the object. Conversely, position information continued to accumulate over multiple refixations of the object.
(b) The interplay between vision and memory for static scene viewing
If representations of the visual scenes we observe are formed, it is reasonable to assume that they might influence ongoing inspection behaviour. Certainly, there is evidence that saccade programming can involve not only immediate visual input but also remembered information. When viewing a series of isolated fixation targets, saccades can be launched to remembered locations of previously presented targets . For more complex scenes, a brief preview of the scene has been shown to facilitate subsequent search for a target object . Using a gaze-contingent moving window paradigm, Castelhano and Henderson showed that search time was faster following a whole-scene preview than following no preview or a preview of a different scene. This result suggests that scene information encoded during the preview period played a role in programming the saccades launched during the search epoch. Information encoded from complex scenes can also influence inspection behaviour over much longer time scales. Repeated viewings of scenes decrease search times, even when repetitions of the scene are separated by several intervening trials .
Oliva et al.  considered the interplay between vision and memory by presenting scenes that extended beyond the bounds of what was visible on the monitor at any one time. Panoramic virtual scenes were presented by panning a virtual camera across an extended scene such that the observer was presented with a moving image on the monitor. Scenes were shown twice: once to learn a set of objects present in the panorama, and subsequently to decide whether each of the learned objects was present or absent. The nature of the responses in the test phase was varied such that visual information at test and memory from training were differentially informative. Participants forced to rely on either visual or remembered information alone were able to complete the search task. However, when both sources of information were present, search behaviour was dominated by the immediate visual information. Taken together, these results argue that remembered information can influence ongoing gaze behaviour, but that for viewing static scenes gaze relocations are primarily under the control of immediate visual input.
(c) Frames of reference for programming saccades
However rich or sparse the information accumulated across saccades, the question arises as to the form in which they are stored, and in particular whether the representations are compensated for the changes in eye direction that result from each saccade. Following a saccade, an object that was in one location on the retina, or on any retinotopic representation in the brain, will now be somewhere else. This means that, if a number of saccades intervene between seeing an object and returning to it, a straightforward representation of the object's original location in retinotopic coordinates will not provide the right vector to allow a return saccade to be made.
There are three ways round this . The memory representation might be kept in retinotopic coordinates, but with each intervening eye movement monitored, perhaps by efference copy signals, and summed vectorially so that when a return saccade is made, it is compensated for the intervening path of the eye. Alternatively, the representation could be stored in head-based coordinates, with object location stored as the sum of retinal location and proprioceptively monitored eye-in-head position. Retrieval in such a scheme simply involves making a saccade based on the difference between current gaze direction and object location, both in head-based coordinates. Thirdly, objects can be located with reference to exocentric cues; that is, to the positions of other objects in a scene. This requires an indexing of the identities and locations of scene landmarks in a quasi-pictorial representation that is not necessarily tied to any one physical frame of reference. Coding of remembered visual information in exocentric coordinate frames, providing a scaffold for immediate visual input, has been suggested on several other occasions [27,28,45]. Karn et al.  favour a combination of head-centred and exocentric reference frames. Others favour a spatial updating scheme based on retinal coordinates, but with mechanisms in the parietal cortex for translating this representation into other, head or body-centred, coordinate frames [46,47]. This possibility of transforming between multiple coordinate frames has been the subject of much research and is particularly important in the context of visuomotor tasks in extended environment; we will return to this issue in §5. As we shall see in §5, during active tasks, it will also be necessary to assume that there are representations of object locations that include parts of the surroundings that are outside the current visual field.
The theoretical perspectives developed by the various authors in the sections above have, in general, been derived from experimental paradigms either mostly or wholly within the realm of static scene viewing. Is it reasonable to consider whether the same representational structures and processes that have been described for static scene viewing would be found in more natural, dynamic settings. Understanding representation in the context of natural settings is important because the role of internal representation must surely be to assist us in reaching our behavioural goals (see also ). We are certainly not the first to raise concerns about the use of static scene viewing paradigms for eye movement research, and the need to consider more dynamic settings. For example, Henderson and colleagues have raised this concern on a number of occasions [48–50]. Hayhoe has also argued the need to study representation in the context of natural tasks and has suggested that the representational processes under such circumstances may be very different from those under static scene viewing conditions .
In the sections that follow, we will consider the questions of visual representation and memory in the context of tasks carried out wholly in proximal space and those that require movement through a larger environment. These two situations place potentially differing requirements on any representational system.
4. Memory during manipulations in proximate space
An important aspect of natural visual environments is that we tend to interact with the scene rather than simply observe it. Therefore, any representational scheme that supports natural behaviour must be flexible enough to deal with how our actions influence the world. Under these circumstances, enduring memories such as those that have been suggested from the static viewing paradigms discussed above may not be useful and indeed may even interfere with efficient interaction with the world. For example, we do not want an enduring memory of the previous location of an object once we have moved it. It may therefore be that being involved in an active manipulation of the environment places different demands upon the representational processes and structures that underlie vision.
(a) Evidence for limited moment-to-moment memory
Ballard, Hayhoe and colleagues have used virtual reality tasks in which participants interact with objects in proximal space, in order to study eye movements and representation in the context of an active task. In a task in which participants used coloured blocks to reconstruct a visible model, the eye movement strategies employed revealed a tight coupling between vision and action . In this task, each goal completion requires knowledge only of the colour of each block and the position in the model at which it should be placed. Despite such limited demands on memory, participants typically looked twice at the model that they had to copy during each cycle of selecting and placing a block: once before selecting the next block, then again before placing it in the construction area (figure 4). This result was interpreted by the authors as suggesting that the two fixations served very different purposes: the first to encode the colour of the next required block, the second to extract the information about where to place the block. Such limited information in the context of this simple naturalistic task is far more consistent with the views of scene representation expressed by O'Regan and Rensink than it is with the more extensive representational schemes discussed by Hollingworth, Melcher and Tatler. Later work from the same authors, however, has shown that memory during the block-copying task may not be as limited as they initially suggested . At the start of each model-building trial, people tended to fixate the relevant block in the model area twice, as described above. However, over the course of the trial, there was a shift towards using a strategy in which the model was not returned to. Instead, after adding a block to the model, the eyes moved straight to the resource area to select the next block, and from there to the construction area to guide the placement of this new block. This latter strategy is only possible if details have been remembered from previous fixations of the model. The gradual shift towards this memory-based strategy later in the trial implies some degree of information accumulation over time.
While Ballard's block-copying paradigm may not be an ideal surrogate for understanding the nature of representations that might underlie natural behaviour, it does point to the possibility that information is only encoded when it is required for the immediate task goals. The notion of only gathering and retaining information at the times when that information is required for the current behavioural goal has been explored and extended by Hayhoe and colleagues, using more semantically distinguishable objects and environments. Triesch et al.  used a virtual block-sorting task to consider the influence of introducing changes at critical times during the execution of the behaviour. In this task, blocks of two different heights were sorted by placing them on one of two conveyor belts (figure 5). Three conditions were used to vary the relevance of height information at various points in the task. In the first condition, participants were asked to pick up bricks from front to back in the virtual space and place all bricks on the nearest conveyor belt. Thus, brick height was relevant to neither the pick-up or put-down decisions. In the second condition, participants were asked to pick up all the taller bricks first and place each on the front conveyor, and then to pick up the shorter bricks and also place them on the nearest conveyor belt. Thus, in condition 2, brick height was relevant to the pick-up but not to the put-down sections of the task. In the third condition, participants were asked to place all the taller bricks on the closer conveyor belt and then to place all the shorter bricks on the far conveyor belt. Thus, in condition 3, brick height was relevant to both the pick-up and put-down decisions. In 10 per cent of all trials, the height of the brick was changed between pick-up and put-down (i.e. while in the participant's hand). Detection of these changes increased as the height of the brick became more relevant throughout the task: 2 per cent of changes were detected in condition 1, 20 per cent in condition 2 and 45 per cent in condition 3. This result argues elegantly that whether information about the height of the brick was retained stably throughout the task depended critically on whether and when the height was relevant to the task.
The importance of task relevance through time in representations for visuomotor tasks was explored further by Droll & Hayhoe  in which the predictability that a cue would be relevant later in the block-sorting task was varied. In this paradigm, blocks were defined by four properties: height, width, colour and texture (figure 5). Visual cues were presented both for the pick-up and put-down decisions. These cues indicated not only which feature to use for the two decisions, but also, in the case of the put-down cue, how to use this feature to sort the blocks between two virtual conveyor belts in front of the participant. In the most predictable condition, the same single feature was used for both the pick-up and put-down decisions. Under these circumstances, refixations of the block once it had been picked up were rare, indicating a reliance on the remembered state of the feature when making the decision about which conveyor belt to place it on. These authors also used an unpredictable condition in which a single cue was used to pick up blocks, but any one of the four features could be selected at random for the put-down cue. In this condition, refixations of the block after it had been picked up, but before it was placed on a conveyor belt were common. This was true even when the put-down cue was the same as the pick-up cue (which occurred in 25% of trials). This result implies that if it can be predicted that information will be required at a later stage of the task, it can be retained stably until needed. However, if it is not predictable that the information will be needed again, that property is not retained and the eyes are used to gather information as and when it is needed for the task.
At this stage, we should compare this seemingly limited and selective scheme of representation derived from visuomotor tasks performed in proximate space with the more comprehensive and detailed representational schemes described in the context of static scene viewing. Certainly, when viewing static scenes (even static real environments in the case of ), there is a large volume of evidence to suggest that much object information can be retained stably throughout viewing and recalled when tested after the trial [27,28,56]. Not only can apparently detailed representations be found for static scene viewing paradigms, but also there is compelling evidence that information is encoded incidentally, and does not require that objects be the target of active memorization . Castelhano & Henderson compared visual memory for a memorization task and a visual search task. These authors found that memory performance was still good for objects in the search task, where there had been no expectation that the object information would be required later as the memory test was unexpected. All of these results from static scene viewing paradigms are very different from those discussed above for visuomotor tasks. Why should such a selective and task-dependent representation be found in active settings, when it is possible to encode and remember much more? A number of possible explanations can be suggested here. First, it may simply be that when engaged in a visuomotor task we employ a principle of efficiency, expending resources only upon maintaining representations of information that are necessary and only for the times at which they are required. Second, it may be that the dynamic nature of the scene places constraints upon representations that are not present when viewing a static scene. For example, memory for object positions in a dynamic setting presents a very different set of problems from memory for position in a static scene. In the latter, a single index of the target with respect to the scene will suffice. However, for a dynamic situation, position information must encompass movements by the observer, movements of elements in the environment and changes in object locations as a result of active manipulations by the observer. These additional demands of encoding information in a dynamic setting may require additional or even alternative representational schemes to those employed when encoding static pictures.
One possible limitation of the dynamic tasks that have been discussed so far is that the apparent sparsity of representation may be confounded by the semantic (and visual) similarities of the objects used in the paradigms. Coloured or textured blocks used on repeated trials may result not only in difficulties for maintaining distinct representations of the component objects in a task, but may also result in interference between trials. Such an interference effect for semantically similar displays has been found in the context of static scene viewing . It will therefore be important in the sections that follow to draw widely from active task settings to include more natural tasks with semantically distinguishable objects.
(b) Evidence for temporally extended spatial memory
The tasks discussed in §4a suggest a scheme of representation, where much of the available information is only encoded when needed and only retained if required later in the task. However, evidence from similar tasks carried out in proximate space has provided evidence for a slightly different form of temporal extension to representations, involving pre-emptive information-gathering. When washing one's hands  or making a sandwich , fixations are occasionally made to objects that are not involved in the current part of the task but will be the focus of an upcoming act. These ‘look-ahead’ fixations may occur several seconds before that object is used but have a measurable benefit for the efficiency with which the target is later re-acquired .
In order for these look-ahead fixations to have a measurable behavioural consequence, some information gathered during these fixations must persist long enough for it to aid the later location of the object. Further evidence for the functional significance of looking ahead comes from the observations that objects that were the focus of completed portions of the task are never looked back to . Thus, the look-ahead fixations are not incidental looks, but are likely to be functional. For natural task settings, it is hard to estimate exactly what information about an object might be extracted during these look-ahead fixations. However, this must minimally be some spatial information about where the object is, in order to produce the observed differences in how quickly and how accurately the object is fixated later in the task when it is the target of the current act. These studies show that information-gathering can be proactive, seeking out information that will be needed in the near future and retaining this information until it is required.
(c) The balance between vision and memory in peripheral vision
In §3b, we discussed studies which demonstrated that saccades can be programmed on the basis of immediate visual input or memory depending upon the relative availability of these two sources of information. Here, we consider the relative roles of vision and memory when targets of visuomotor tasks are present within peripheral vision. It is usually assumed that we use vision to locate objects that are within our peripheral field of view, but there is evidence that this is not always the case, and that memory may be equally important even for objects that are plainly visible. The resolution of peripheral vision is poor, and the angle that an object must subtend to be identified increases dramatically and approximately linearly with its angular distance from the fovea . Very approximately, if a letter 0.1° high can be identified in the fovea, it needs to be 1° high at an eccentricity of 10°, and 6° at 60°. Aivar et al.  used a variant of Ballard's block-copying paradigm to consider whether saccades to blocks in peripheral vision are guided primarily by vision or memory. In this task, the layout of the blocks in the resource area (figure 4) was changed when the participants looked away from this area. Provided the participants had sufficient time to familiarize themselves with the layout of the environment before the first change was made, saccades to the resource area were launched to the remembered locations of blocks rather than to the actual post-change locations. This is in spite of the fact that the resource area was within peripheral vision at the time that these saccades were launched. Aivar's result clearly implicates an important role for memory in planning saccades to targets in peripheral vision.
Brouwer & Knill  used a virtual visually guided reaching task to consider the relative use of vision and memory for guiding action. In this task, two virtual objects had to be picked up and placed in a trash bin. In some trials, the position of the second target was moved by a small amount while the first was being moved to the trash. While participants never noticed this, it did have a noticeable influence on their behaviour. Essentially, this perturbation means that vision and memory were in conflict when the arm movement towards the second target was executed. Brouwer and Knill found that both vision and memory played a role in the targeting decision, but the relative weighting of vision and memory in planning the reach to the second target depended critically on the visibility of the second target. Targeting high-contrast objects involved a greater reliance on visual information than did targeting low-contrast objects, where remembered position was relied upon more. From this, it seems that the targeting system uses a blend of what is in immediate vision, and what is available from the current representation of the surroundings.
5. Spatial memory during active tasks requiring movement
In the above section, we have argued that when engaged in an active task performed within proximate space, the representational scheme that underlies such behaviour appears to be rather limited. However, the demands placed upon memory and representation in such settings are in some ways rather reduced compared with the potential requirements for representation in situations where task completion involves movement through an extended environment. For example, when preparing food in a kitchen we need to know not only about the work surface we are facing, but also about those on either side or behind us. One central problem that any representational scheme must solve in a real-world setting, which is not present in static scenes, is the spatial reference frame in which we must represent our surroundings.
(a) Spatial organization in natural settings
When comparing spatial organization in pictures and the real world, two issues are immediately apparent. First, the scales at which spatial information occurs are very different. Second, pictures cannot readily be used to distinguish between the range of frames of reference in which space may be coded in natural settings.
Montello  has suggested that we can classify space into four different categories. Figural spaces are smaller than the body and include objects and pictures. Vista spaces are larger than this but only encompass what can be seen by an observer from a single viewpoint. Environmental spaces go beyond what a single observer can see from a single vantage point, but are bounded by what a human can reasonable explore on foot. Geographical space is that beyond the exploration capabilities of a single individual. In natural tasks, at least figural and vista space will be important, but in many cases, understanding our surroundings in environmental space is also important (such as understanding where different rooms are in our house, or shops in other parts of town). Pictures are therefore problematic in two ways. First, they cannot encompass the larger scales of spatial information that we need to understand in natural settings. Second, they compress vista space into figural space: a whole vista is presented within the bound of a picture, which itself occupies figural space.
Given the different levels at which space can be represented, one question that arises is whether we have entirely separate representations for each of these levels, or whether there is cross-talk between them. That is, does the representation of our current vista (e.g. a room) interact with our representation of the environmental space outside the room? Hirtle & Jonides  used a variety of methods to test participants' recall for a real extended environment, and concluded that levels of representation were nested hierarchically. In contrast, Brockmole & Wang [66,67] found no evidence for cross-talk between vista representation and the representation of environmental space.
Insights into the frames of reference in which we might encode and retain information about scenes can be made by considering the effect of changes in viewpoint when viewing static or dynamic scenes. Simons & Wang  used an array of objects on a tabletop to explore the influences of changing viewing position (by walking between two viewing positions) and retinal projection (by rotating the tabletop) upon change detection. Changing the retinal projection between views by rotating the table resulted in poorer change detection. However, changing the retinal projection by the same amount by asking the observer to walk to a new viewing location did not have any detrimental effect on change detection. The importance of generating the movement between viewpoints was demonstrated by sitting participants in a wheeled chair and wheeling them (with eyes closed) to the new location. This manipulation resulted in poorer change-detection performance. Wang & Simons  conducted a series of follow-up experiments to reinforce the suggestion that viewers can update representations across active changes in viewpoint, but not across passive changes in viewpoint. For dynamic movie sequences, the ability to encode information across viewpoints is unclear. Garsoffky et al.  found recognition accuracy to be higher when scene memory was tested using the same viewpoint as experienced by the viewer when watching a movie sequence than when the viewpoint at test did not match that at encoding. This result is consistent with a viewpoint-dependent representation. However, Garsoffky et al.  showed no such cost of viewpoint change when recognizing computer-animated basketball scenes, consistent with a viewpoint-independent representation. While the evidence suggests that active exploration of the world is essential for being able to integrate information across viewpoints, the coding scheme in which the information is represented remains unclear from these studies. However, the importance of active exploration and the ability to use changes in viewpoint both imply that spatial coding of objects can occur in coordinate frames that are neither retinocentric nor wholly exocentric.
(b) Transformations between frames of reference
The coordinate frame in which space may be represented in the brain has been the topic of much research [46,72–74]. It is clear that muscular movement plans must ultimately be coded in limb-centred coordinates. Similarly, visual information must initially be coded in retinotopic space. Indeed, it is clear that in the context of natural behaviour, a range of different spatial coding schemes must be involved and must act in parallel (figure 6). However, it seems likely that efficient coordination of multi-sensory input and motor output must involve transformation between the various parallel frames of reference for spatial coding. Converging evidence suggests that such transformations are possible and that the parietal cortex is crucially implicated in multimodal spatial organization.
One question is whether the parietal cortex simply handles the transformations between multiple frames of reference or combines across representations to form a master representation of space. Chang et al.  found evidence for parietal transformation between eye- and hand-centred representations, consistent with a single representation of eye–hand distance. Whether the parietal cortex forms a master map or simply handles the transformation between representational frames of reference, it is clear that efficient behaviour relies upon the integration of information coded in a range of different frames of reference. It should also be noted that frames of reference for sensory processing and motor responses are all in some way centred on the individual rather than in exocentric coordinates.
(c) Spatial memory in natural tasks
In natural tasks, we often find gaze relocations to objects that are currently outside our field of view. In a study of tea making, Land and colleagues found that gaze changes of up to 180° were often made to objects on other surfaces in the kitchen . Sometimes these were made with a series of saccades, but frequently they were completed with a single saccade that involved combined rotations of the eyes, head and trunk. Importantly, the movement of the gaze was continuous until the target was reached (figure 7).
Most of these gaze shifts were accompanied by a long blink, so that vision would have been impossible for most of the movement. This means that the complete gaze movement must have been pre-planned. Typically, these gaze saccades were off-target by about 10°, and were followed 200–300 ms later by a second small saccade that brought the fovea onto the target (figure 8). All this suggests that the system that allows gaze to target unseen objects has access to the same transient egocentric representation of the surroundings that makes it possible to locate, and point to, objects in the world around us when they are not currently visible. The resolution of this representation is not good enough to allow exact targeting, but it seems that it is sufficient to bring the foveae close enough to the target to allow a second saccade to be made under visual control.
(d) Allocentric maps and egocentric models
Recent accounts of the way we encode information about objects, places and routes in the world around us propose that we have two kinds of spatial representation: allocentric and egocentric (e.g. [73,77]). While we highlighted the established notion that there are a variety of egocentric coding schemes in §5b, our primary concern here is to argue for the utility of egocentric coding for spatial representation in natural tasks rather than to determine which of the possible ego centres is at the heart of this coding scheme.
The allocentric representation is map-like (figure 9a). It is indexed to a world-based coordinate system, is independent of our current location and heading and survives over extended periods of time. This representation must of course be built up from vision over time, but does not rely on immediate visual input. Longer term memories of the present or similar environments are integrated into this representation.
When walking in a natural environment, there is evidence for the storage of information about objects encountered on the route, which is consistent with an updating of a world-based allocentric representation. Droll & Eckstein  asked participants to walk a course around a building eight times. A variety of objects were arranged close to the path that the participants walked. While the participants walked this course, changes were made to nine of the objects located near the path. These changes were made to objects while they were out of sight for the participants. When simply instructed to walk the route, participants were very unlikely to detect these object changes (5% detection). However, when asked to prepare for an object memory test that would follow the experiment, participants were far more likely to detect the change (32% detected) and also spent longer looking at individual objects. This result is consistent with the notion of encoding information in world-centred coordinates and also suggests that, like the studies discussed in §4, such representations are selective and task-based.
The other kind of spatial representation, the egocentric representation, is temporary, and based on the directions of objects relative to our current body position and heading in the space around us (figure 9b). It is this second representational frame that allows us to act upon our environment, for the purposes of locating, reaching for and manipulating objects. We can think of the egocentric model as containing low-resolution information about the identities and locations of objects throughout the 360° space around us. It is available for making targeted movements of gaze or arm irrespective of whether or not it is supplemented by direct visual information. Of course, the view we see is not the same thing as the egocentric model. The seen world has detail, colour and movement, none of which is an obvious property of the model. We are conscious of what is in the field of view, particularly the central region around the fovea, to a much greater degree than we are of objects outside the visible part of our surroundings. Nevertheless, in familiar places, we are aware of what is outside the current field of view and are able point to or make saccades to unseen objects with reasonable accuracy (figure 8). The egocentric model can be updated from the allocentric map by a process akin to map-reading: finding one's location on the map and matching one's current heading to it. It can also be refreshed by direct visual input, adding or correcting the locations of particular features.
Although authors differ in the emphasis placed on each kind of representation in this dual scheme [73,74], the idea of a combination of an enduring and a temporary store generally accords well with people's intuition of how they operate in space. There is now a great deal of evidence for the existence of both kinds of representations in the brain, with the allocentric map located in the hippocampus and the medial temporal lobe, the egocentric model in the parietal lobe and translations from one to the other occurring in the retrosplenial cortex . A number of lines of evidence favour the precuneus on the medial face of the parietal lobe as a likely location of the egocentric model (e.g. ).
To be of continuing use, the egocentric model must always be oriented so that it is aligned with the current field of view. Thus, as we move through our environment, the model must be constantly updated and rotated to match our body rotations. Rotations of the body must be accompanied by corresponding rotations of the egocentric model in order for this model to serve the planning and execution of the next motor command. If gaze is rotated 110° clockwise, the egocentric model must rotate 110° anti-clockwise (figure 10).
Although the prospect of a model of the world rotating in the brain seems alarming, there is a precedent. Duhamel et al.  found that the cells in the lateral intraparietal (LIP) area ‘remap’ the locations of stimuli when the eyes move. The receptive fields of the whole array of LIP neurons shift in such a way that the new target becomes the centre of the array about 80 ms before the saccade begins. We are proposing a similar ‘software’ transformation here for the egocentric model.
In order to consider the consequences of head and body rotation for updating the egocentric model, it is worth returning to the issue of what should be the centre of the egocentric model we describe. This model could be centred around gaze, the head or the body, and the consequences of rotations of each of these components would be different when updating the egocentric map depending upon which forms the ego-centre. For example, if the head were rotated without body rotation, the consequences for the egocentric model would be very different if it is coded in head- or body-centred coordinates. For the purposes of the present discussion, we wish to remain somewhat agnostic about the frame of reference for the centre of the egocentric model. This is in part because, for much of the time that we are involved in natural tasks, there is a co-alignment of at least the head and the body, and often of gaze too: we tend to move such that we bring eyes, head and body in line with the target of the current manipulation and it is only in the transitions between each manipulation that these three components become unaligned. The consequence of this is that when the plan is made to move on to the next object, it is usually planned and initiated at a time when gaze, head and body are all in close alignment. Even if this were not so, the ability of the parietal cortex to translate between different egocentric reference frames, as emphasized by Colby & Goldberg , makes it difficult to design a test that would distinguish between them. The fact that the egocentric model must itself rotate during movement (figure 10), and that the head has its own rotation sensor in the vestibular system, perhaps argues in favour of a primary role for a head-centred representation.
To illustrate the way vision and the egocentric map interact, let us consider the example of intending to pick up a mug, situated somewhere within or outside the range of peripheral vision (target T in figure 11). Its location can be obtained, at least roughly, from the egocentric model. But to grasp the mug, more details are required. Before picking up the mug, accurate coordinates of its location relative to the body are required, together with the direction in which the handle is pointing, so that the hand can be pre-shaped accordingly. For this level of detail, foveal vision is required, and so a gaze shift is needed to bring the fovea to bear on the mug. The gaze-directing system then consults the egocentric model about the likely whereabouts of the mug, and directs the foveae of the eyes to it. (This may involve body and head movements as well as eye movements.) After a gaze movement based on coordinates from the model, and in many cases a further eye movement based on its seen retinotopic location, gaze is brought as close to the target as it needs to be. Having acquired the target, the visual system is now in a position to supply the motor system of the limbs with the information needed to formulate the required action. Further actions may ensue. Filling a kettle, finding coffee or a tea-bag, pouring hot water into the mug, then milk and so on. Each action requires one or more foveal fixations to provide new information, but the gaze movement system also needs to refer back to the egocentric model to find the locations of new objects as they are required.
This dual scheme of representation, in which the egocentric map is updated on the basis of both sensory input and reading from the allocentric map, offers an efficient coding scheme in which our action plans can be executed within a space constructed from sensory and remembered information. Such a dual scheme allows for the potential of varying our reliance on the two types of information depending upon the relative reliability and availability of these types of information. Moreover, the relative reliance upon vision and memory can impact upon our moment-to-moment behaviour in two ways. First, and as described here, we may rely more upon remembered information from our allocentric map when constructing the egocentric map. Second, we may vary the reliance upon immediate retinocentric visual information and the egocentric map depending upon the reliability and availability of visual information . Within the visual field itself, the balance between information from vision and memory is likely to vary with eccentricity, with vision dominating in the region around the fovea and being increasingly supplemented by egocentric memory towards the periphery. The topography of this balance has yet to be explored.
6. Stability of the visual world revisited
Unlike the temporally and spatially disjointed series of images provided by the retinae, our phenomenal visual world is seamless and stable. Many possible explanations have been put forward for this stability, including the spatial updating of receptive fields by efference copy that occurs in parietal area LIP and other regions of the cortex [47,80], the fact that much of the pictorial content of the image is discarded with each saccade  lessening the need to integrate successive fixations or the possibility that we simply ignore the discrepancies because we know the world to be stable . Here, we introduce a further suggestion, namely that the current direction of gaze is anchored not only to the instantaneous visual scene, but also to the current egocentric model.
The attraction of this idea is that the egocentric model provides a continuous panoramic layout, so that moving the direction of regard around it need not involve the kinds of visual dislocation presented by the behaviour of the retinal image itself. Since the egocentric model rotates as gaze rotates (figure 10), objects within the model retain their spatial relationships with the external world as we look around it. Thus, if our conscious readout of the layout of the world derives from the egocentric model, this layout will stay still as gaze rotates, and this is indeed the way the world seems to us. We are not suggesting that the model is a substitute for the detailed pictorial information contained in the visual input itself, but rather that it acts as the reference frame to which gaze changes are indexed. It is true that the resolution of the egocentric model is low, but then so is our ability to detect displacements during a saccade . The indexing only has to be good enough to paper over the cracks.
This suggestion may also help to explain why, when looking around a familiar scene, we feel no discontinuity when making saccades into the regions beyond our current field of view. There are no surprises because we already have an outline of what is to be found there. Indeed, as the observations of Brouwer & Knill  make clear, the egocentric model overlaps and interacts with the visual input, and as figure 8 demonstrates, it can provide location information on its own when objects are out of sight. Thus, the egocentric model can provide a geometrical base within which the locations of objects can be stored temporarily, before being passed on to more permanent allocentric memory.
We thank the editor and two anonymous reviewers for their helpful suggestions on the previous version of this manuscript. This work is supported by a research grant awarded to B.W.T. by The Leverhulme Trust (F/00143/O) and a grant awarded to M.F.L. from the Gatsby Foundation.
One contribution of 11 to a Theme Issue ‘Visual stability’.
- This Journal is © 2011 The Royal Society