Given that multiple senses are often stimulated at the same time, perceptual awareness is most likely to take place in multisensory situations. However, theories of awareness are based on studies and models established for a single sense (mostly vision). Here, we consider the methodological and theoretical challenges raised by taking a multisensory perspective on perceptual awareness. First, we consider how well tasks designed to study unisensory awareness perform when used in multisensory settings, stressing that studies using binocular rivalry, bistable figure perception, continuous flash suppression, the attentional blink, repetition blindness and backward masking can demonstrate multisensory influences on unisensory awareness, but fall short of tackling multisensory awareness directly. Studies interested in the latter phenomenon rely on a method of subjective contrast and can, at best, delineate conditions under which individuals report experiencing a multisensory object or two unisensory objects. As there is not a perfect match between these conditions and those in which multisensory integration and binding occur, the link between awareness and binding advocated for visual information processing needs to be revised for multisensory cases. These challenges point at the need to question the very idea of multisensory awareness.
If we think that phenomenal experience is of any relevance to psychology or cognitive science or plays any role in our mental life or behaviour, then we will have to have some grasp on it.
Marcel [1, p. 171].
Given the frequency of cases in which different senses are stimulated (especially if we include the vestibular and proprioceptive senses which are almost constantly sending signals to the brain), there are reasons to consider that the paradigmatic case of perceptual awareness that we have to account for is likely to be multisensory. As stressed by several authors
A fundamental aspect of our sensory experience is that the information from different modalities is often seamlessly integrated into a unified percept.
Angelaki et al. [2, p. 452].
In our daily lives we perceive each multisensory event as an integral of multiple unisensory signals by using not only vision but also audition and the sense of touch.
Fujisaki et al. [3, p. 301].
The richness of perceptual experience, as well as its usefulness for guiding behaviour, depends on the synthesis of information across multiple senses.
Fetsch et al. [4, p. 429].
Multisensory awareness, however, seems to have fallen into the blind spot of current research. To date, theories of perceptual awareness have typically considered the phenomenon as if it were a unisensory, mostly visual, construct [5–7]. Over the past few years, there has also been a growth of interest in the study of awareness in several of the other senses as well (such as in touch, see [8,9]; audition, see ; and even olfaction, see ) but without a major change: perceptual awareness remains studied on a modality-by-modality, or unisensory, basis.
This unimodal perspective is abandoned in multisensory research, but the focus, then, is not so much on awareness per se, but rather on ‘multisensory processing that is, how the brain deals simultaneously with information from different sensory systems’ [12, p. 3]. The main argument against the unisensory perspective is, in fact, that situations of multisensory stimulation often lead to a behavioural or neural response that ‘is significantly different (e.g. larger, smaller) from the best component response’ obtained during unisensory stimulation (e.g. [13, p. 1717]; see also ). The integration of information from different sensory modalities is, then, often supposed to be matched at the level of awareness by a genuinely multisensory experience:
To perceive the external environment our brain uses multiple sources of sensory information derived from several different modalities, including vision, touch and audition. All these different sources of information have to be efficiently merged to form a coherent and robust percept.
Ernst & Bülthoff [15, p. 161].
In addition to processing information on a sense-by-sense basis, the brain is responsible for assembling the rich mélange of information from the various sensory modalities into a coherent and meaningful ‘view’ of the world.
Wallace et al. [16, p. 252].
It is important to stress that the kind of multisensory awareness mentioned here does not consist of unimodal experiences being jointly present in awareness, while keeping their individual characters. One can be aware of the sound of the bell while watching one's computer screen, for instance, but the auditory awareness of the sound and the visual awareness of the screen do not necessarily mix. They merely seem to coexist and, according to a philosophical distinction, they constitute cases of ‘phenomenal unity’ . Cases of genuinely multisensory awareness are different, as they imply that the information coming from the different senses merges into a single unitary experience.
One kind of multisensory awareness typically includes cases in which the information concerning a single property of an object coming from different sensory channels is integrated into what people then want to label as an amodal representation of that property (though see  for a critical view of the notion of amodality). For instance, our awareness of the ball that we may be holding in our hands will be both visual and tactile; our awareness of our hands is itself presumably both visual and proprioceptive (figure 1a,b). Another kind of multisensory experience corresponds to cases where modality-specific features, such as coloured shapes and sounds are experienced together as features of one and the same object or event (figure 1c,d). One is, then, aware of a multisensory object or event—for instance, a yellow bird singing. These two kinds of cases of multisensory awareness can look different in the sense that the first concerns the integration of information into a single property, and the second the integration of parts into a single object (or whole). The processes underlying them are then sometimes distinguished as being processes of either ‘multisensory integration’ or ‘crossmodal binding’. However, following the literature on visual binding, it is common to see them treated together, as if they were instances of the same general process—and the distinction between integration and binding starts to blur. ‘Binding’ and ‘integration’ are then interchangeably used to refer to the process by which information concerning distinct sensory features (such as tactile shape and visual shape, or sounds and colour) is bundled together as information concerning a common perceptible item. Although stated for vision, Treisman's point below is often applied to multisensory cases
(i) the binding of properties—the integration of different object properties, such as the color red and the shape + to form a red cross; (ii) the binding of parts—the integration in a structured relationship of the separately codable parts of a single object, such as the nose, eyes, and mouth in a face … The first two seem to me to be closely related and to depend on the same binding mechanism.
Treisman [20, p. 98].
Research on multisensory cases does not only follow the model developed for visual research regarding the assimilation of property and object integration. Just as for vision, the operation of binding is seen as a necessary condition for consciousness [20–27].
Can we simply take the current theories and protocols used to try and understand unisensory cases and then import them into the field of multisensory research? This is the approach that we wish to question here. As argued below, shifting to multisensory cases is not cost-free for the study of perceptual awareness. It introduces both methodological and theoretical pressures—what we broadly call here ‘constraints’. In §2, we focus on those methodological constraints presented by multisensory cases and review the limitations on the canonical protocols that are used to study unisensory awareness when considered in a multisensory setting. In our recent search through the literature, we have been unable to find a crucial experiment in which the results unambiguously demonstrate that a genuine case of multisensory awareness has occurred. If there is no doubt that subjective reports will refer to multisensory objects, or that the responses obtained in multisensory cases differ from a simple addition of the unisensory stimuli, it is very difficult to find an objective measure to confirm that awareness is genuinely multisensory. In §3, we review the reasons that might lead one to dissociate multisensory integration or binding from multisensory awareness. In §4, we turn to the theoretical constraints, showing why the shift to multisensory cases raises fundamental questions about the unified or dis-unified nature of consciousness, and the role of attention in determining our conscious experience of the objects and events in the world around us. Here, we introduce two alternative models that are capable of accounting for the experience of multisensory objects without claiming that awareness itself becomes multisensory. According to these two alternative models, awareness remains unimodal.
2. Methodological constraints
The past decade or so of studies on the neural basis of awareness, combined with decades of research on the topic of implicit or unconscious processing, have helped lead to the design of robust empirical protocols to evidence cases where awareness occurs from those cases where it does not. The most famous examples here might be Owen et al.'s  studies of those most unfortunate of patients who find themselves stuck in a vegetative state, some of whom demonstrate responsiveness to verbal instructions—which, in turn, has been taken as evidence that the contents of the latter were consciously perceived (but see [29,30] for discussion). Multisensory cases raise quite a different challenge, as what needs to be evidenced is not the occurrence of a conscious episode, but rather the occurrence of a certain mode of awareness (the ‘merged’ or multisensory percept mentioned in §1). The evidence collected, therefore, needs to be incompatible with other interpretations, such as the occurrence of merely unisensory modes of awareness. How successful are the various protocols used to study perceptual awareness in a unisensory setting in terms of addressing this major question?
As reviewed below, drawing in part on our recent interest and work to investigate this question, experiments can successfully demonstrate a crossmodal modulation of participants’ unisensory awareness. The majority of experimental studies have involved the presentation of a stimulus that is ambiguous or difficult to identify in one sensory modality (e.g. vision); at the same time, a clear and unambiguous stimulus is presented in another modality (e.g. audition). Presumably, or so the logic goes, if the sensory inputs from the two modalities are integrated, the clear and unambiguous one should bias the interpretation of the less clear, or more ambiguous, one, thus giving rise to a clearer perception of the ambiguous stimulus. These important results, however, fall short of locating the facilitatory effect in multisensory awareness. Most of these protocols, at most, merely demonstrate that the information presented in one sensory modality will boost the occurrence of perceptual awareness of stimuli presented in another modality.
(a) Ambiguous displays
One sort of evidence that would support the multisensory awareness view would be if any perceptual alternations in the interpretation of an ambiguous display presented in one sensory modality were to be influenced by (or somehow linked to) the perceptual alternations taking place in another modality. However, to date, there is surprisingly little evidence in support of such a claim (see [29,30] for empirical evidence; and [31,32] for reviews). The only study that we are aware of that comes close to what we are after here was reported by Sato et al. . These researchers investigated the auditory and visual verbal transformation effect. In the auditory version of this phenomenon , a participant listens to a speech stimulus that is presented repeatedly, for example the syllable ‘/ps/’. After a number of repetitions, the sound will be perceived to alternate and the observer likely will hear it as ‘/sp/’ instead. As time passes by, the percept then alternates back and forth between the two possible interpretations.
Sato et al.  discovered that the same thing happens if an observer looks at moving lips that happen to be repeatedly uttering the same syllable (this is known as the visual transformation effect). In their study, Sato and his colleagues presented auditory alone, visual alone and audio-visual stimulus combinations that were either congruent or incongruent (i.e. an auditory ‘/ps/’ paired with a visual ‘/sp/’). The participants were instructed to report their initial auditory ‘percept’, and then to report whenever it changed over the course of the 90 s of each trial. The incongruent audio-visual condition, in which the visual stimulus alternated between being congruent and incongruent with what was heard, resulted in a higher rate of perceptual alternations than any of the other three conditions (though see  for the suggestion that speech may constitute a unique case when it comes to multisensory multistability).
(b) Binocular rivalry
The phenomenon of binocular rivalry can be seen as providing a fascinating window into visual awareness in humans . Binocular rivalry occurs when two dissimilar figures are presented to corresponding regions of each eye. Observers perceive one of the figures as dominant (while often being unaware of the other). After a while, the dominance of the figures may well reverse and then keep alternating over time. This perceptual alternation has been attributed to the fact that the visual system receives ambiguous information and hence tries to find a unique perceptual solution, and therefore the two images compete for control of the current conscious percept (see  for a review). Several studies have tried to understand how visual awareness emerges in the binocular rivalry situation. According to Helmholtz's  early view, the alternation of perceptual dominance is under voluntary attentional control. Later researchers, however, have argued that the phenomenon occurs as a result of the competition between two monocular channels , or else between two pattern representations, one presented to either eye . More recent models  have suggested that the mechanism that underlies binocular rivalry includes not only competition at multiple levels of information processing (for a review, see ), but also some form of excitatory connections that facilitate the perceptual grouping of visual stimuli , as well as top-down feedback, for example attentional control [44–46]. That said, the underlying mechanisms giving rise to conscious perception in the binocular rivalry situation likely involve a variety of neural structures throughout the visual processing hierarchy.
van Ee et al.  conducted the first study to demonstrate that the information from another sensory modality, in this case either audition or touch, can modulate people's visual awareness in the binocular rivalry situation. In their study, the participants were presented with a looming pattern to one eye and a rotating pattern to the other eye. The dominance duration of the looming (or rotating) pattern was shown to be enhanced when the rate at which the pattern happened to move was synchronized with a series of pure tones or vibrotactile stimuli (or their combination). Nevertheless, it is important to note that this crossmodal modulatory effect was only observed when the participants endogenously attended to the visual pattern that was consistent with the rate of stimulus presentation in the other modality. Therefore, on the basis of this crossmodal modulation of participants' awareness, one cannot exclude the possibility that the information presented to the other sensory modality was modulating visual awareness via an attentional process.
To date, a growing number of studies have documented the multisensory modulation on the perception of binocular rivalry in terms of various crossmodal correspondences. For example, an auditory modulation of binocular rivalry has been shown to result from coherent moving direction (e.g. motion on the left/right axis ; or motion on the looming/receding axis ). Crossmodal modulations have also been observed in terms of the semantic coherence of natural/artificial objects (such as the sound of a bird-chirping or a car-engine revving, see ). Meanwhile, the tactile modulation of binocular rivalry occurs when the grating in the visual and tactile modalities has the same orientation [51,52]. Finally, the odour of rose/marker pen has been shown to enhance the dominance duration of the corresponding visual percept [53,54] or vice versa (i.e. visual modulation on binasal rivalry, see ).
In all of the cases reported above, the multisensory inputs can, to a certain extent at least, be argued to be coherent. The possibility therefore remains that the information in another sensory modality provides a cue that simply directs a participant's attention toward the congruent percept.1 Chen et al.  therefore designed an experiment in order to try and tease apart the auditory semantic and the attentional modulation of binocular rivalry (figure 2): they manipulated the target of participants’ selective attention over the dichoptic figure independently from the meaning of the sound. That is, their participants were instructed to maintain the bird percept, to maintain the car percept or else to view the figures passively in the control condition [45,46]; meanwhile, the participants either heard the sound of birds chirping or else an engine revving. Their results demonstrated that both the auditory semantic context and the participant's voluntary attention influenced the participants’ dominance percept and that these two factors appeared to work in an additive fashion. A further analysis of dominance duration of the bird/car percept (that was not reported in the published paper) revealed that the presentation of the auditory semantic context was more likely to shorten the duration of the semantically incongruent percept than to prolong the duration of the semantically congruent one. Note that this effect is different from the modulatory effect of attention on binocular rivalry that resulted from a prolonging of the dominance duration of the attended percept . In summary, Chen et al.'s results suggest that the modulation of binocular rivalry by crossmodal congruency and attention can, to a certain extent, be dissociated, and hence that the former should not always be thought of as relying on the latter.
(c) Bistable figure perception
Bistable figure perception constitutes another phenomenon whereby the presentation of an ambiguous static figure can elicit a dynamically alternating percept: when an observer looks at an ambiguous visual figure that affords two possible interpretations (such as the Necker cube or the Rubin face/vase figure), they may perceive either one of the interpretations initially, and then become aware of the other interpretation a short while thereafter. Subsequently, as time goes by, the observer's perception will likely alternate spontaneously between the two possible interpretations. The critical difference between binocular rivalry and bistable figure perception lies in the fact that in the latter condition, observers watch the same ambiguous figure with both eyes; that is, the visual stimulus cannot be segregated into two visual patterns early in visual information processing. This difference certainly leads to the suggestion that the level at which the visual competition occurs should be higher for the case of bistable figure perception than in the case of binocular rivalry. As a result, it has been suggested that the perception of bistable figures may be more susceptible to modulation by attention than binocular rivalry [45,46].
Crossmodal modulations of bistable figure perception have been reported using, for example, the face/vase figure modulated by speech stimuli  and the ‘my wife or my mother-in-law’ bistable figure  modulated by female voices of recognizably different ages . In Hsiao et al.'s  work, the modulatory effects of auditory semantic congruency and attention were further compared. In particular, the sound that the participants heard (the voice of either the old woman or the young lady) and the participants’ attention over the bistable figure (when either instructed to try and maintain the view of the old woman, to maintain the view of the young lady or to view the stimulus passively) were manipulated independently. Interestingly, the results demonstrated that even though the effect of auditory semantic congruency was significant in all three attention conditions, its magnitude was significantly less pronounced when the participants were attending voluntarily to either the old woman or young lady views (as compared to the passive viewing condition). That said, the modulations attributable to auditory semantic congruency and attention in the bistable figure perception case are difficult to disassociate completely. In addition, combining the results of research on the auditory modulation of binocular rivalry and bistable figure perception can be taken to suggest that crossmodal modulations likely represent a heterogeneous set of empirical phenomena (or, put another way, they are determined by different mechanisms) in various situations (see also ).
(d) Continuous flash suppression
Continuous flash suppression (CFS) is the name given to a recently developed method that uses the mechanism of interocular suppression in order to maintain a visual stimulus in a subliminal state for a few seconds [62,63]. In the typical experimental setting, an observer is presented with a low-contrast stimulus to one eye and a dynamically changing high-contrast colourful Mondrian-type pattern to the other eye in the overlapping visual field of each eye. This presentation protocol often results in the low-contrast visual stimulus being suppressed and hence being unavailable for conscious report by the observer. If the contrast of the suppressed stimulus is gradually increased, however, it eventually breaks through to consciousness and hence can then be detected (and reported on) by the observer. In this paradigm, the response time (RT) for participants to detect the suppressed stimulus is usually used as the behavioural index of awareness.
Alsius & Munhall  used CFS to demonstrate that RTs were shorter when a suppressed talking face was paired with a matched auditory speech stream, as compared with an incongruent one (though note that Palmer & Ramsey  have demonstrated that the unaware, or suppressed, face does not modulate auditory speech identification). Similarly, Yang & Yeh  have reported that RTs to detect a static face are modulated by various face–voice correspondences that include timbral features (e.g. mom's or baby's laughter) and emotional prosody (e.g. happy or angry). Using CFS can, then, be used by researchers to provide robust evidence that audio-visual integration can occur at an unconscious level and that it can, in turn, boost the subliminal visual stimulus to the conscious level.
(e) The attentional blink
When two different letters serving as targets are embedded within a string of digits and all of these stimuli are presented one-by-one in a rapid serial visual presentation (RSVP) stream, the participants frequently miss the second target (T2) after having detected or identified the first (T1), especially when T2 is presented close in time to T1 (typically between 100 and 500 ms later). This phenomenon, known as the attentional blink (AB), was first reported by Raymond et al. . Note that even though T2 may not be consciously detected or reported by the observer, it can nevertheless still elicit a priming effect to the stimulus presented afterward . That said, T2 is processed while suppressed below the level of conscious awareness.
Olivers & van der Burg  demonstrated that when a tone is presented at the same time as T2, participants are more likely to report it correctly. However, if the tone is presented with the item that precedes T2 instead, no such crossmodal facilitation is observed; that said, this crossmodal facilitatory effect was not attributable to the tone alerting the participants regarding the imminent arrival of T2 (see also  for a review of the literature in this area). Interestingly, the magnitude of the AB can be increased (i.e. the accuracy of T2 given T1 was correctly reported) when T1 is a letter paired with a different rather than same letter name presented auditorily . That said, the crossmodal modulation of the AB is closely linked to participants’ devoting attentional resources to the processing of T1.
(f) Repetition blindness
Repetition blindness (RB) refers to the phenomenon whereby when an item is presented twice within an RSVP stream, observers may miss its second occurrence (; note that RB occurs within rapid serial auditory presentation streams as well; see ). It has been suggested that RB reflects a failure by the observer to individuate the two occurrences of the same item; that is, the type of that item has been recognized while people were unaware of its second occurrence .
Chen & Yeh  presented two tones that were simultaneous with the onset of the two repeated items. Participants were more likely to correctly report the repeated item twice (i.e. the RB effect was reduced), compared with when the tones were not presented. In a follow-up study, Chen & Yeh  manipulated the two tones so that they had either the same or different pitch. Once again, the presentation of the tones effectively reduced the magnitude of the RB effect to a similar extent, which suggests that temporal synchrony is the critical factor in the audio-visual facilitation in the RB paradigm.
(g) Backward masking
In the backward masking paradigm, two stimuli are presented, one after the other, from more or less the same spatial location, with the first serving as the target and the second as the mask. Backward masking has frequently been used in previous crossmodal studies in order to control the visibility of a visual target by manipulating the stimulus onset asynchrony between target and mask. Within a certain range of task difficulties, the presentation of a simultaneous (or asynchronous within a range of about ±200 ms) sound has been shown to enhance the accuracy with which participants detect or identify the visual target. Such crossmodal facilitation effects have been reported under various conditions of backward masking. So, for example, when the perception of the target light emitting diode (LED) is degraded by the subsequent presentation of four surrounding LEDs (e.g. [77,78]; see also ), or when a target letter is followed by a masking letter .
(h) Interim summary
This far, we have reviewed the latest research to have emerged from those studies that have used a number of different experimental paradigms to demonstrate that visual stimuli are processed despite having been suppressed in an unconscious state. By presenting a stimulus in another sensory modality (typically audition) that is coherent, or congruent, with the visual stimulus in a certain regard (such as in terms of its temporal synchrony or semantic content), multisensory integration may occur, or the other sensory input may guide attention to the coherent (or congruent) visual stimulus .
3. The limits of the binding-awareness link
It is unquestionably the case that individuals report experiencing objects or features across two or more sensory modalities. They talk about singing birds, warm blue cups or whistling kettles. The assumption, though, is that these reports correspond to an awareness that is genuinely multisensory in nature and that this maps onto multisensory processes of binding or integration. We want to point out, however, that the existence of this mapping is put under pressure by various lines of evidence.
(a) Evidence of multisensory interactions does not constitute evidence of multisensory awareness
It is fair to say that most of the evidence that has been collected about multisensory interactions has not focused on the relation between these processes and consciousness. At the level of human behaviour, the speed and accuracy of a participant's performance in a multisensory task setting has often been shown to be significantly different from their baseline level of performance obtained under conditions of unisensory stimulation. Performance with auditory and visual targets presented together will, for instance, be better than that with the auditory or visual target presented alone. Whether participants perform better because they are aware of a single audio-visual target or because of the influence of one type of sensory information on the processing of another is, however, not demonstrated (see  for discussion).
At the neurological level, the occurrence of multisensory integration is evidenced by the fact that the responses of single neurons (or else populations of neurons, or even specific voxels or groups of voxels in neuroimaging studies) are significantly different in a specific multisensory setting from the sum of the responses that have been observed in a unisensory setting [13,82]. An alternative index of multisensory integration at a whole-brain level may be revealed by studying the synchronized oscillation of neural activity. That is, when visual and auditory stimuli are presented close together in space and/or time, or semantically congruent, neural activities coherent in phase in the gamma-band frequency range can often be observed [83,84]. This, or other forms of activation, though, fail to say anything about the nature of an observer's conscious experience, and whether participants (or the anaesthetized animals on which the majority of the early multisensory integration research was conducted) experience a unified object or event is simply not evidenced by such measures. This said, neurological studies might provide evidence of whether multisensory integration occurred in the brain and whether it is the focus of attention; see for instance, the study by Fairhall & Macaluso  which reported that multisensory integration (proved by subcortical activation in the superior colliculus) grabs attentional resources (as shown by activations in high-order fronto-parietal areas) even when entirely task-irrelevant.
(b) Subjective reports do not establish the occurrence of multisensory awareness
The method that gets closest to investigating the occurrence of multisensory awareness consists in looking for contrast cases at the level of subjective report. The principle is to establish when subjective reports are consistent with the experience of a single object integrating the features processed in two (or more) modalities and to contrast them with reports of two separates features or events in two modalities. This technique has been widely used in the field of multisensory research. For instance, individuals have frequently been asked which syllable has been uttered when presented with incongruent speech sounds and clips of lip movements (for instance, speech sounds corresponding to the syllable ‘ba’ and lips articulating a ‘ga’, see ). In some conditions, participants report experiencing a syllable (‘da’; this is known as the McGurk effect) which is inconsistent with what they should have experienced if they relied only on audition and suggest that their experience has been influenced by the information that has been presented visually. In another famous case, Shams et al.  had their participants say whether they were aware of one or two flashes when jointly presented with a single visual flash and two brief sounds.
Here, though, the perceptual reports can indicate the occurrence of audio-visual awareness, but are also compatible with the experience itself being in one sensory modality (auditory for the speech stimulus in the McGurk effect, visual for the individuation of flashes in the two-flash illusion): the experience in one modality might well have been influenced by the second source of information presented in another modality which does not, however, appear as such to form a unified multisensory conscious episode. The fact that participants only report experiencing a multisensory object when the component stimuli are presented within a certain temporal and/or spatial window, could mean that spatio-temporal proximity is necessary for participants to judge that the two pieces of information refer to one and the same object (e.g. ; though see also ).
We are actually sceptical that any contrast case will help to decide between cases where information coming from two sensory modalities are genuinely experienced together and cases where they are experienced one after the other and merely judged to have come from the same source and to belong to the same object. That said, let us be clear that this does not mean that we do not see a conceptual difference between the two .
In order to make some progress here, the study of multisensory consciousness should not be limited to the study of perceptual qualitative reports about objects and events. Take all those incidental reports concerning accompanying feelings—such as feelings of unity or disunity, and of incongruence and congruence  that have been noted in studies resting on the presentation of stimuli in different sensory modalities. Are they not conscious manifestations as well? Anyone who has experienced the McGurk effect can try it for themselves and see that, when the auditory and visual stimuli start to fall too far apart in time , or when a male voice is paired with a synchronized female face, one may feel slightly ill at ease (see also [92,93]). Jackson  presented his participants with the sound of a whistling kettle coming from one of seven different locations. He then asked them to determine from which of seven visible kettles (placed behind a Perspex screen) by which they were surrounded the sound was coming from. As Jackson himself admitted, the participants reported a mild feeling or impression of ‘intermodal conflict’ when they received an auditory stimulus and a simultaneous visual cue separately, although they supposed and perceived that it was from one and the same object. Many of the participants maintained that they ‘must be imagining’, or that they had experienced a ‘detached feeling’, or even nausea when they received an auditory stimulus and a simultaneous visual stimulus separately, although they supposed and perceived that it was from one and the same object. This might signal a gradual breach in the experience of a single object even in what presumably count as successful cases of multisensory integration or another kind of conscious manifestation.
Metacognitive feelings—of which one is also aware—are then an obvious category to investigate in relation with the awareness that manifests itself in multisensory situations, but which seem to have disappeared from experimental reports since the 1970s (just see [90,95] for a couple of engaging early examples). Researchers no longer seem to want to ask their participants what they felt about an experiment or set-up or stimulus display, perhaps because of the difficulties associated with the treatment of their participants’ (admittedly sometimes strange) replies. However, asking participants how confident they felt about their perception of a single object, or else starting to pay attention to individual differences in subjective reports is a desirable change that should be imported from unisensory to multisensory studies of consciousness.
(c) Dissociating multisensory integration and multisensory awareness
Confirming our initial diagnosis, the investigation of multisensory awareness is not central in many studies documenting multisensory interactions, and establishing whether and when awareness is multisensory rather than unisensory is problematic. Several recent studies have pushed things further and advocate a dissociation between the occurrences of multisensory integration and multisensory awareness.
The most recent demonstration comes from a study, using a paradigm used for visual binding, showing that reports of experiencing a single audio-visual object were not perfectly aligned with the independent evidence of multisensory integration . The authors concluded that ‘feature binding was entirely unaffected by conscious experience: features were bound whether they were experienced as occurring together or as belonging to a separate events, suggesting that the conscious experience of unity is not a prerequisite for, or a direct consequence of binding’ (, p. 586). Meanwhile, research on the Colavita visual dominance effect provides evidence of multisensory integration in the absence of awareness of one of the component stimuli (normally it is the visual stimulus that extinguishes awareness, or at least report, of some proportion of the auditory stimuli on bimodal target trials ).
More pressure also comes from the claim that even certain reasonably well-established examples of multisensory interaction (and assumed cases of multisensory awareness) may not actually necessitate there being any kind of integration in the first place. Over the years, many studies have, for instance, taken violations of the Miller inequality under conditions of simultaneous audio-visual stimulation as evidence for multisensory integration. The redundant target effect shows that responses are faster when visual and auditory target stimuli requiring a speeded detection response are presented together  than when they are presented separately. However, as argued recently by Otto & Mamassian , the results of such studies do not necessarily show that sensory evidence is integrated multisensorially prior to the participant making a perceptual decision, but rather may have been accumulated separately for each signal (see also ). In addition, Chen & Spence's  study demonstrating that simultaneously presented semantically congruent pictures and sounds can modulate a participant's response criterion without necessarily impacting on their perceptual sensitivity (as indexed by d′) in a picture-detection task provides another good example here. One thing to investigate is whether individuals would then be happy to report having been aware of multisensory objects or events. If not, these subjective reports of integrated object would not necessarily correspond to a process of multisensory integration.
In summary, contrary to the visual paradigm, binding and awareness come apart in multisensory cases. They even present a form of double dissociation: awareness is not necessary for all cases of multisensory integration; multisensory integration is not necessary for the attribution of features to a single object across sensory modality.
4. Theoretical challenges: choosing among three competing models
Does the fact that several sensory modalities cooperate to build our awareness of the objects and events in the environment around us necessarily mean that awareness (or what philosophers might prefer to call ‘phenomenal consciousness’2) is itself multisensory? How can we be sure that a situation of multisensory stimulation leads to the awareness of a unified multisensory object (figure 3a) rather than merely to a sequence, or juxtaposition, of dis-unified unisensory states of awareness (figure 3b,c)?
Whereas the aforementioned evidence is indeed compatible with a unified multisensory awareness (for instance, an audio-visual conscious episode), it is also compatible with at least two other distinct possibilities which maintain that each conscious experience remains unisensory. The two alternative models resort to a fast alternation between episodes of unisensory awareness so that, at any one moment in time, one is only aware of one object or feature in one sensory modality. What explains the behavioural results, and the report of experiencing a single object with features or aspects that belong to different modalities, is the fast and unnoted attentional shifts between the distinct unisensory experiences. It might be, then, that the unattended experience is simply not conscious, but brought in consciousness when attention shifts to the corresponding modality (figure 3b). Awareness, being the outcome of where (and on what) attention is focused, switches between the two features depending on an observer's voluntary attentional set or based on the pattern of external (or exogenous) stimulation that happens to be impinging on the sensory end organs at any given time . Alternatively, however, it might be that the unattended experience is conscious, but simply not salient (figure 3c). In both cases, the main claim is that for the brain to deal with simultaneous information from different sensory modalities might not always, or necessarily, lead to the relevant information being integrated into unified multisensory object representations that are then made available to awareness (see also [81,107] for discussion).
One reason to resist the view that awareness is just a ‘seamless percept’ is the observation that cases of multisensory stimulation have never been shown to confuse observers with regard to the sensory modalities in which they have had an experience: with the exception of the confusing case of flavour, which might itself anyway have to count as a single sense [108,109], individuals can still tell that they saw a flash, heard a beep and felt a stroke of their hand. In the rubber-hand illusion, the result of seeing the rubber hand being stroked while being synchronously stroked is felt as a change in one's proprioceptive awareness . If the percept is so unified, and if modes of awareness are fused, it seems difficult to address the question of how one recovers the modal signature of each component feature (unless this is done outside of consciousness, see ). Thinking about awareness keeping a form of inner distinction or being rather like a juxtaposition of modal awareness, which could be part visual and part auditory, is difficult to reconcile with the view that awareness is just multisensory through and through. What this shows is that theories of multisensory awareness need to go beyond the generic or merely metaphorical talk of ‘seamless percepts’ or ‘unity’ and explain the mechanisms at stake in the sharing.
One thing we feel that the multisensory awareness approach explains more elegantly than the other approaches is the attribution of features belonging to distinct sensory modalities to one and the same object. As we said earlier, this is certainly part of everyday phenomenological reports and it seems simpler to think that awareness just mirrors these reports. However, the possibility cannot be excluded that reports of experiencing multisensory objects or features could come from a rapid oscillation between unisensory experiences and the unity of the object could be simply part of the cognitive treatment of this shifting of information. If the solution is plausible, though, it certainly needs to be developed further: how does the switch from, say, the awareness of a sound to the awareness of a shape tell us that this sound is of this shape or that the sound and the shape refer to one and the same object? Is spatial co-location preserved in the switch, and is it sufficient to give what one expects from the attribution to a single object or event ? Various replies could be developed here: the unity of objects across sensory modalities could be a cognitive or conceptual construct, and not directly experienced. There is no reason why unisensory and multisensory objects should necessarily be explained in the same way, and binding could, for example, explain the first and not the second. Recently, several authors (see  for an overview) have also started to look back into early Gestalt principles of connectedness and proximity, as perhaps more relevant ways to think about the relations between the unisensory experiences than those offered by the multisensory awareness model (see also [32,113]).
In this article, we have attempted to stress a gap between the observation that paradigmatic cases of awareness occur in multisensory settings and the fact that awareness remains studied in a unisensory framework. The recycling of unisensory protocols is unlikely to provide good ways to study multisensory awareness, if there is indeed such a thing. One additional element not to be missed is the possible role of attention in multisensory cases (see [114,115] for reviews, and §2b,c). If there is much to expect from the growing body of research in this area, we would like to suggest that other possible manifestations of multisensory interactions—such as metacognitive feelings of congruence—should also become a central focus.3
Our main conclusion is that multisensory cases of perceptual awareness do not just challenge empirical methods of investigating awareness, but raise fundamental questions regarding our theories of awareness—including the prospect of unveiling its neural correlates. The natural assumption when it comes to cases of multisensory stimulation is that the different modes of awareness generated in unisensory cases can all homogeneously be integrated into the multisensory awareness of a multisensory object or event. So, for instance, the awareness of two yellow balls noisily bouncing against one another is treated as an audio-visual percept that integrates both the visual awareness of the yellow balls and the auditory awareness of the bouncing sounds (e.g. ; though see ). The view here clearly seems to be that awareness is of the same kind regardless of the particular sense that happens to be under consideration: that is, there would appear to be nothing different, as far as our theories of conscious awareness are concerned, between being aware of a sound and being aware of a patch of colour, or say a taste. These are all just modes of perceptual awareness, but not different kinds of consciousness—the reason, then, why people never talk about a plurality of consciousness. The quest for the neural basis of awareness, in this sense, has been made a general project that spans across the various sensory modalities and one does not expect different replies to arise for each of the senses. However, simple and elegant, this homogeneity assumption remains to be tested, along with the principle of unified multisensory awareness.
O. Deroy and C. Spence are funded by the RTS-AHRC grant, within the ‘Science in Culture’ scheme.
One contribution of 13 to a Theme Issue ‘Understanding perceptual awareness and its neural basis’.
↵1 One could legitimately wonder whether any mental imagery that is elicited by the information presented in another sensory modality  can serve as a top-down cue that may modulate perception in the binocular rivalry situation .
↵2 Phenomenal consciousness refers to the fact that the world seems or appears a certain way from an observer's point of view; it is sometimes distinguished from access consciousness (see  for a recent statement). The notion is close to Nagel's  idea that there is something ‘it is like’ to perceive the world from a subjective point of view.
↵3 Finally, what about other conscious activities, such as mental imagery or memory? Arguably, if an experience is multisensory, then the memory or re-enactment of that experience will also be multisensory—but does it really work so simply? On the one hand, it is currently unclear whether we have imagery for complex sensory experiences, for example, flavours. On the other hand, the rapid switching of awareness between unisensory episodes of imagery or memory also needs to be compared to explanations in term of multimodal episodes .
- © 2014 The Author(s) Published by the Royal Society. All rights reserved.