The verbal transformation effect (VTE) refers to perceptual switches while listening to a speech sound repeated rapidly and continuously. It is a specific case of perceptual multistability providing a rich paradigm for studying the processes underlying the perceptual organization of speech. While the VTE has been mainly considered as a purely auditory effect, this paper presents a review of recent behavioural and neuroimaging studies investigating the role of perceptuo-motor interactions in the effect. Behavioural data show that articulatory constraints and visual information from the speaker's articulatory gestures can influence verbal transformations. In line with these data, functional magnetic resonance imaging and intracranial electroencephalography studies demonstrate that articulatory-based representations play a key role in the emergence and the stabilization of speech percepts during a verbal transformation task. Overall, these results suggest that perceptuo (multisensory)-motor processes are involved in the perceptual organization of speech and the formation of speech perceptual objects.
The verbal transformation effect (VTE) is experienced during rapid and continuous repetition of a speech sound. For example, while listening to the repetition of the word ripe, listeners initially perceive the veridical percept (i.e. ‘ripe’), but after a few seconds, they may report hearing utterances such as ‘right’, ‘rife’ and ‘bright’ . This transformation process persists throughout the repetition procedure, leading to perceptual switches from one percept to another. The VTE was first reported by Warren & Gregory  and was described as an auditory illusion which may reveal the processes responsible for enhancement of the accuracy of perception in the noisy environments encountered in everyday life . It is a speech-specific case of perceptual multistability, which is well known in vision (for a review, see ) and in audition (e.g. ), and provides a rich paradigm for studying the perceptual organization of sensory scenes.
Perceptual organization refers to the structuring of a sensory scene into meaningful perceptual objects. Drawing an analogy to the visual domain, Bregman  describes two types of mechanisms, respectively based on primitives and schemas, guiding the perceptual organization of auditory scenes (for a review of auditory multistability, see ). Primitives are basic, probably largely innate, auditory processes driven by the incoming acoustic data while schema-based mechanisms involve the activation of learned familiar perceptual patterns. Bregman  suggests that the same types of mechanisms, i.e. auditory primitives and phonetic schemas, are used for speech scene analysis. A few years later, Remez et al.  argued that general auditory scene analysis mechanisms fail to explain the perceptual coherence of speech. The authors suggested instead that ‘the perceptual organization of speech depends on sensitivity to time-varying acoustic patterns specific to phonologically governed vocal sources of sound (, p. 130)’. In addition, they proposed that ‘ … visible structures of articulation and the acoustic signal of speech combine perceptually as if organized in common to promote multimodal perceptual analysis (, p. 152)’.
The VTE provides an efficient tool for studying speech scene analysis. It enables researchers to address questions such as: (i) is it possible to decompose the speech stream into ‘natural’ units of information that play a pivotal role in the transformation process? (ii) If there are indeed elementary units of information, are they purely auditory or are they multimodal (audiovisual at least)? More generally, can auditory and visual streams be bound together in the multistability process? (iii) What is the role of motor processes in the processing of speech representations and in the switches from one state to an other? (iv) Ultimately, what is the nature of the speech perceptual object (auditory, motor or perceptuo-motor)? In this paper, we review a number of recent results on the VTE, related to these questions about the perceptual organization of speech.
2. Auditory bases of the verbal transformation effect
(a) Primitive- and schema-based verbal transformations
Different types of verbal transformations have been reported in the literature (e.g. [9–13]). They can be tentatively separated into primitive- and schema-based transformations based on Bregman's classification scheme for auditory scene analysis.
The primitive-based VTE should involve only general principles, not based on learned speech forms. This is typically the case for transformations based on auditory streaming. For example, the repetition of /skin/ may lead to the segregation of the /s/ sound from the main stream, so that /kin/ or /gin/ is perceived in the foreground and /s/ in the background . These transformations can be considered as phonetic equivalents of auditory switches between fused and segregated percepts as described earlier  (see also ). Another very common kind of transformation can be described as the online re-segmentation of the speech stream. For example, listening to the rapid repetition of the word life may lead to the perception of the word fly (by cutting the acoustic stream before ‘f’ rather than after), which may transform back, after a few seconds, to life and so on. This segmentation process could be conceived as a ‘speech primitive’ grouping mechanism in which perception alternates between two possible organizations of the phonetic stream.
The schema-based VTE involves learned patterns such as in phonetic, lexical or semantic modifications of the repeated sequence. Phonetic modifications consist of the substitution of the repeated phoneme by a phonetically similar one (e.g. /ɛ/ → /æ/ ) and/or phoneme insertions (e.g. // → /pold/ ). Lexical modifications can be small phonetic deviations of the repeated sequence resulting in words (e.g. ‘ripe’ → ‘bright’ ). Finally, verbal transformations may involve semantic modifications with large phonetic deviations from the repeated sequence (e.g. ‘trice’ → ‘florist’ ). Note that schema-based mechanisms may also contribute to what we have called primitive-based VTE (at least in verbal transformations that involve re-segmentation, such as ‘fly’ → ‘life’) in the sense that possible alternating forms can be meaningful words. However, this does not exclude the contribution of ‘speech primitives’ in the primitive-based VTE.
(b) Explanations and models for the verbal transformation effect
Adaptation of speech representations and perceptual regrouping of acoustic elements have been suggested to be the cause of the VTE . The most elaborate framework for understanding the VTE is the node structure theory (NST) . The NST is a localist model (i.e. a single node represents one processing or memory unit) with three levels of representation: muscle movement, phonological and sentential. The nodes at the muscle movement level are dedicated to speech production and represent patterns of movement for the speech muscles. The phonological level contains sublexical nodes, such as syllables and phonetic features. The nodes at the sentential level represent words and phrases. During speech perception, the input primes nodes at the phonological level. Priming increases the possibility of a node being activated but it cannot by itself activate a node. The strength of priming is related to how well a node matches the input of the model. The priming spreads from the phonological level to the lexical nodes (at the sentential level). Several nodes may be primed simultaneously. The activating mechanism in the NST ensures that only one node is activated at once: only the most-primed node that reaches the threshold becomes activated. According to the NST, the repeated activation of a node owing to the repetition of an utterance causes adaptation of the node. This results in a fall in the node's priming eventually to below the priming of a competitor node. The competitor node therefore becomes dominant, and a verbal transformation occurs .
Another class of models is based on the properties of dynamic systems (see also ). Ditzinger et al.  developed a computational model for the VTE based on a model of perceptual switches of visual reversible figures . In this model, percepts are represented as system attractors, i.e. local minima of the system energy. A verbal transformation occurs when the energy associated with a percept increases, so that it no longer represents a local minimum. This leads the system to switch to a more stable state. The energy increase in this model is owing to the saturation of attention to the current percept. Notice that, while this model provides a possible explanation for multistability based on multiple equilibrium states within limited cycles of a dynamic process, it does not clarify the nature of speech representations and the underlying analysis processes.
3. Articulatory processes underlying some verbal transformation effect properties
The potential role of the perceptuo-motor link in the VTE has been explored in various kinds of paradigms, and this has led to the emergence of two possible roles for motor processes: increasing the probability of transformations, and selecting some transformations rather than others.
(a) Motor involvement favours transformations
In a series of experiments, Reisberg et al.  demonstrated the role of articulatory processes in the VTE. The authors asked participants to mentally repeat a word and report any verbal transformations they perceived. Other participants were asked to repeat the same word aloud and report any transformations from the repeated word. Different degrees of subvocalized involvement were tested (whispering, silent mouthing and mental repetition without mouthing). The authors also examined some conditions where subvocal involvement was impeded, by having participants chew gum or do a concurrent articulatory task, or by clamping their articulators. The results showed that the probability of a verbal transformation increased with increasing degree of both auditory feedback and articulatory involvement: verbal transformations were more frequent with overt repetition than with whispering and silent mouthing, and than with purely mental repetition. Moreover, when subvocal involvement was impeded, the VTE disappeared. These results suggest that articulatory processes play a role in the VTE.
(b) Articulatory processes influence the selection of preferred perceptual states
In two series of studies, Sato et al. [20,21] explored whether articulatory mechanisms could influence the preference for some percepts over others in the VTE. Their hypothesis was that some sequences of articulatory gestures, which are considered as more stable than others in speech production, should also be more often produced in verbal transformations. A typical example is the syllabification process, according to which some sequences like CV (a consonant C followed by a vowel V) are more frequent in human languages than others like VC (see e.g. [22,23]). Various perceptual or motor explanations of this preference have been proposed and tested (see e.g. ). In the ‘articulatory phonology’ framework for analysing speech production in terms of articulatory gestures coordinated in time into temporally overlapping structures [25,26], it is proposed that the CV sequence is favoured because the consonantal and vocalic gestures can be coupled in-phase, i.e. triggered synchronously. In contrast, a VC sequence would be triggered out-of-phase, with the vocalic gesture produced first, and the consonantal gesture produced only when the V is completed. The conclusion from articulatory phonology is that the CV sequence is more stable in production thanks to the in-phase coupling of C and V. Actually, various experiments have shown that, when asked to speak rapid sequences of CV or VC sounds, the latter tend to be produced as CV sounds, whereas the reverse does not occur (e.g. ). Interestingly, in the VTE, transformations occur much more often toward syllables beginning with a C, as in CV, than toward those beginning with a V, as in VC (e.g. ). A possible interpretation is that the preferred transformation involves more stable articulatory sequences, stability being defined with reference to in-phase coupling of the articulators.
Sato et al.  tested whether this tendency could be observed, in a more general way, for other speech sequences than CV and VC. For example, in the classic ‘life life life’ sequence, the two Cs, ‘f’ and ‘l’ can be produced almost in synchrony in ‘fly’ , while the two consonantal gestures are desynchronized in ‘life’ (being separated by the V in the middle). The same reasoning as previously leads to the prediction that the sequence with more in-phase coupling between the two consonantal gestures, i.e. ‘fly’, should be more stable than the less synchronized ‘life’. Thus, the authors hypothesized that sequences displaying larger inter-articulatory synchrony in speech production should be preferred states in the VTE. To test the hypothesis, they asked French participants to repeat monosyllabic sequences, such as repeated /ps/ or /sp/, in an overt (i.e. saying aloud) or covert (i.e. saying mentally) mode and to report verbal transformations (non-word /ps/ and /sp/ were used rather than ‘life’ and ‘fly’ to avoid any lexical effect). The repetition of /ps/ can lead to /sp/ and vice versa (other transformations are also possible, see ). Previously, the authors had shown that a speeded production of /ps/ or /sp/ sequences both displayed a significant trend towards /ps/. The results showed a significantly larger number of transformations from /sp/ to /ps/ than from /ps/ to /sp/, both in the overt and covert modes. This was explained by the fact that /ps/, for which the consonantal gestures of /p/ and /s/ are produced almost in synchrony, would involve more articulatory stability than /sp/. More specifically, a verbal transformation from /spspsp … / to /pspsps … / involves a resynchronization of /p/ and /s/ gestures, which leads to a more stable speech sequence. The articulatory synchrony assumption could, therefore, explain why /ps/ was a more stable form in this experiment (figure 1a).
The next step involved gathering evidence for the role of articulatory stability in the VTE in a purely auditory mode (with no explicit speech production involvement). This was the focus of the study by Sato et al. , using C1VC2V-type stimuli where C1 and C2 were /p/ and /t/ and V was /a/, /i/ or /o/ (e.g. /pata/, /topo/ and /piti/). In this study, participants were asked to listen to repetition of a C1VC2V sequence and to report any verbal transformations they perceived. Most of the reported transformations included the veridical sequence C1VC2V and its ‘reversible’ form C2VC1V (for example, transformation from /pata/ to /tapa/ and vice versa). However, there is a trend in human languages that labial–coronal C1VC2V sequences such as ‘pata’ and ‘bada’ (/p/ and /b/ being labial Cs and /t/ and /d/ coronal ones) are more frequent than coronal–labial sequences such as ‘tapa’ and ‘daba’ [23,29]. This is known as the labial–coronal effect. The same kind of preference exists in infants' first words . This tendency might have an articulatory origin. It has been suggested that the /pata/ sequence is more stable than the /tapa/ sequence since it can be produced on a single jaw cycle by anticipating the coronal gesture /t/ while producing the labial one /p/, but the reverse pattern is not possible  (figure 1b). As a matter of fact, since the lips are in front of the tongue tip in the vocal tract, anticipating the lip closure in /tVpV/ sequences would result in closing the vocal tract in front of the tongue tip and hence hiding the coronal C /t/ at the beginning of the /tVpV/ sequence, making it inaudible. Once again, a speeded production task confirmed that ‘pata’ sequences are more stable than ‘tapa’ sequences . The assumption of Sato et al.  was that the underlying jaw cycles would lead listeners to segment / … (pa)tapatapatapa … / sequences into ‘jaw-compatible’ /pata/ chunks rather into the reverse /tapa/ utterances. Hence, the /pata/ state would be preferred in the VTE. This was actually the case: /pata/ was heard significantly longer than /tapa/ and globally /pVtV/ percepts were more stable than /tVpV/ percepts (figure 1b; notice that other interpretations of the preference for /pata/, based on psycholinguistic arguments such as lexical status and word frequency of repeated sequences and phonotactic probability (i.e. the probability of occurrence of a particular phonetic segment in a given word position) were discounted on the basis of controls carried out in the original study).
In summary, the results reviewed in this section suggest that articulatory processes play a significant role in the VTE, both by controlling the emergence of new states, and by orienting perception towards some states rather than others, preferred states being associated with more ‘stable’ articulatory sequences, either because of increased inter-articulator synchrony, or because of the role of underlying jaw cycles chunking sequences into coherent pieces of information.
4. Multimodal nature of verbal transformations
Speech perception is not merely auditory but multi-modal. Several studies have shown that visual information from the speaker's face can increase the intelligibility of speech sounds (e.g. ). The McGurk effect demonstrates the role of audiovisual interactions in speech perception (for example, auditory /ba/ dubbed on visual /ga/ can lead to the perception of /da/) . Models of audiovisual speech perception generally consider that there is a preliminary stage of independent processing of the auditory and visual inputs, before fusion takes place at some level (e.g. [32,33]). However, recent data suggest that interaction between the auditory and visual flows could happen at a very early stage of auditory processing (e.g. ) and that the visual speech flow could modulate ongoing auditory feature processing at various levels . This leads to the idea that the perceptual organization of speech should be conceived as audiovisual rather than purely auditory, consistent with the suggestion of Remez et al.  (see §1). From this point of view, the VTE could be a useful paradigm for assessing how audition and vision are bound together in the perceptual organization of audiovisual speech.
(a) Visual speech influences the verbal transformation effect
In a series of studies, Sato et al.  examined whether visual information from the speaker's face influences the VTE. The participants were asked to listen to and/or watch the stimuli and report any changes in the repeated utterance they perceived. In a first experiment, the stimuli were repetitions of /ps/ and /sp/ sequences. They were presented in four conditions: audio (A) only, video (V) only, congruent audiovisual (AV) and incongruent audiovisual (AVi). In the AVi condition, the /ps/ audio track was dubbed onto the /sp/ video track or vice versa. All transformations were classified as /ps/, /sp/ and ‘other’. The global stability duration of each reported form was calculated by summing the durations spent perceiving that form. The results showed that the global stability duration of the percept that was consistent with the auditory track was lower for the AVi condition than for the AV condition. For example, the global stability duration of /ps/ was larger when the audio and the video tracks were /ps/ (AV condition) than when the audio track was /ps/ and the video track was /sp/ (AVi condition). In a second experiment, the authors used video tracks changing over time from AV to AVi, and from AVi to AV, while keeping the audio track constant. The results showed a larger effect of the visual input than in the first experiment. Moreover, there was a high temporal synchrony between reported transformations and changes in the video track. The visual changes, from /ps/ to /sp/ and vice versa, appeared to precisely control the verbal transformations. Since the visual material was essentially characterized by a salient visual lip-opening gesture for the /p/, the authors suggested that the visual driving of the VTE was controlled by this visual onset event.
(b) Vision may influence chunking of auditory stimuli
The potential role of visual onset on the VTE was tested in a third experiment using /pata/ and /tapa/ sequences which contain two salient visual onsets, one for /pa/ and one for /ta/. These stimuli may switch from the perception of ‘pata’ to the perception of ‘tapa’, though with a bias towards ‘pata’ as discussed in §3b. They were presented in four conditions. In an audio only condition (A), just the repeated acoustic sequences /pata/ or /tapa/ were presented. In the AV condition, both the audio and the video input were presented. The other two conditions, called AV-pa and AV-ta, were prepared from the audiovisual coherent material. In these conditions, the video component was edited to remove all the information about either the /ta/ or the /pa/ syllables. These syllables were replaced by /a/ images in order to provide subjects with only visible /pa/ gestures in the AV-pa condition, and visible /ta/ gestures in the AV-ta condition, in contrast to the AV condition, where both /pa/ and /ta/ gestures were provided. In other words, in the AV-pa condition, the acoustic stimulus /pata/ was dubbed on a video stimulus /paaa/, or the acoustic stimulus /tapa/ was dubbed on a video stimulus /aapa/. In the AV-ta condition, the acoustic stimulus /pata/ was dubbed on a video stimulus /aata/, or the acoustic stimulus /tapa/ was dubbed on a video stimulus /taaa/. Participants were asked to report any transformations they perceived. The prediction was that the visible-opening gesture (/pa/ for AV-pa or /ta/ for AV-ta) would stabilize the percept beginning with it, leading to a preference for the /pata/ percept in the AV-pa condition, and for the /tapa/ percept in the AV-ta condition. As predicted, seeing the lip-opening gestures significantly drove the perception towards the sequences characterized by the visual onset: in the AV-pa condition, listeners perceived /pata/ more often than /tapa/, regardless of the repeated auditory sequence. An inverse pattern was observed for the AV-ta condition. This result suggests that the lip-opening gestures provide listeners with onset cues that are used in audiovisual speech segmentation.
The next question addressed was whether the visual onset cue could be provided by any visible information, even non-speech, or if it was specific to seeing the articulatory gestures through lip movements. Some hints that the effect might be speech-specific are available from previous studies showing that the audibility of speech sounds embedded in noise is improved by seeing coherent lip movements, but that the enhancement is decreased or eliminated if lip movements are replaced by bars going up and down in synchrony with the original lip movements [37,38]. Therefore, in an original experiment reported next, we tested whether the visual onset effect observed by Sato et al.  would occur when the lip movements of /paaa/ and /taaa/ were replaced by a vertical bar varying in height. Two a priori hypotheses were tested. If the visual onset effect is not speech specific, then it should occur for bars as well as for lips, and may even be stronger with bars than with a moving face in which the information is much less focused on the adequate visual information. In contrast, if the visual onset effect is speech-specific, it should not occur with bars, or at least it should decrease, if the bars can be interpreted as suggesting lip movement associated with the speech stream. Testing these hypotheses thus required assessment of whether the effect was greater with lips or with bars.
Two sets of conditions were contrasted: visual onset provided by lip opening or by bar movement. Audio stimuli /pata/ and /tapa/ were dubbed onto four different videos. Two of them contained real lip movements of /paaa/ and /taaa/. For the two other videos, the lip movements of /paaa/ and /taaa/ were replaced by a vertical red bar moving in synchrony with the audio sequences. The time course of lip movements was simulated by four images: the minimum lip opening (/p/ and /t/ closure) was replaced by minimum bar height (identical for /p/ and /t/). Bar height increased linearly to its maximum, replacing maximum lip opening for /a/, remained stable for 280 ms, and decreased linearly before the next repetition of the audio sequence. In this way, the temporal audiovisual coherence of the stimuli was preserved while changing the nature of the visual information. Notice that while the minimum lip opening is typically different for /p/ and /t/, the bar dynamics were kept exactly the same for /p/ and /t/. Hence, the bar should provide only timing information, and lead to symmetric effects for AV-pa and AV-ta conditions. Five conditions were presented for each audio sequence: audio-only (A), AV-pa with lip movement (lip-pa), AV-ta with lip movement (lip-ta), AV-pa with bar (bar-pa) and AV-ta with bar (bar-ta). Seventeen native French speakers with no reported hearing problems and with normal or corrected to normal vision participated. They were asked to listen to and watch the stimuli and report any changes in the repeated utterance they perceived.
All transformations were classified as /pata/, /tapa/ and ‘other’. The mean global stability durations (see §4a) of /pata/, /tapa/ and ‘other’ reported percepts for the five conditions and for the two audio sequences are shown in figure 2. As expected, the majority of transformations were /pata/ and /tapa/. The analysis was performed on the difference between the global stability durations of /pata/ and /tapa/, expressed as the percentage of the total stimulus duration. We call this difference ‘delta score’. This measure reflects the relative global stability duration of /pata/ and /tapa/ percepts. Using this measure enabled us to test the effect of the presentation conditions on the stability of these two percepts. An ANOVA on delta scores showed a significant effect of condition (F4,64 = 3.86, p < 0.01) and of audio sequence (F1,16 = 6.20, p <0.05), but no interaction (F4,64 = 0.54). Post hoc Newman–Keuls analysis showed a significantly larger delta score for lip-pa and bar-pa than for A, lip-ta and bar-ta conditions. The fact that the lip-pa condition led to significantly increased stability of the /pata/ percept (larger delta score for lip-pa than for A), while the lip-ta effect was not significant is not surprising and is consistent with our previous results , since the visual movement for /pa/ is much larger (therefore, more visible) than for /ta/. It is more surprising that the same difference occurred with bars, since in this case, the movement amplitude was kept exactly the same for bar-pa and bar-ta stimuli. This suggests that the ‘bar’ effect may be partly driven by an underlying interpretation of bars as lips, with implicit knowledge of visual differences between /pa/ and /ta/. In a further analysis, we assessed whether there was any difference in the capacity of lip movements and bars to drive percepts towards /pata/ (while seeing /paaa/) or /tapa/ (while seeing /taaa/). To do this, we calculated the difference between delta scores1 for lip-pa and lip-ta (lip-index) and bar-pa and bar-ta (bar-index) separately for each participant. A paired t-test showed a significantly lower value for bar-index than lip-index (t16 = 2.25, p < 0.05). Bar-index values were correlated with lip-index values (r = 0.55, p < 0.01).
In summary, the movement of a vertical bar simulating lip-opening gestures can provide visual onset cues, but it is somewhat less effective than lip movements in driving percepts. This is in agreement with the hypothesis that the effect is driven by the perception of the visible onset cue as a speech gesture, though further experiments are certainly necessary to confirm this conclusion. The asymmetry of the effect between bar-pa and bar-ta conditions, similar to the asymmetry between lip-pa and lip-ta, is consistent with this assumption. The fact that the effects of lips and bar were correlated suggests that there is probably a single effect with lips and bars (though in a somewhat weaker way in the second case) rather than two different effects, one speech-specific and the other psychophysical.
The experiments reviewed in this section clearly show that multistable speech perception is audiovisual. The visual onset effect provides an additional input to the ‘articulatory chunking’ mechanism reviewed in §3. Possible chunks are determined by articulatory principles including inter-articulator synchrony and jaw opening–closing cycles. The visual modality may participate in this ‘chunking’ process through a visual onset mechanism driving the percept towards the item beginning with the visible onset.
5. Involvement of the ‘dorsal stream’ in the verbal transformation effect
Recent brain imaging and transcranial magnetic stimulation studies suggest a functional link between speech perception and production. Indeed, brain areas involved in the planning and execution of speech gestures (i.e. the left inferior frontal gyrus, the ventral premotor and primary motor cortices) have been repeatedly found to be activated during auditory, visual and/or auditory–visual speech perception [39–50]. In addition, perceptual performance on auditory syllable decision tasks is affected by temporarily disrupting the activity of the speech motor centres, thus suggesting a mediating role of the motor system in speech perception, especially under noisy conditions [51–54].
In recent neurobiological models of speech perception, the motor activity has been attributed to a so-called temporo-parieto-frontal ‘dorsal stream’, which is supposed to provide a mechanism for the development and maintenance of correspondence between sound-based representations in the superior temporal gyrus and articulatory-based representations in the inferior frontal gyrus/premotor cortex, via sensorimotor recoding in the posterior part of the superior temporal gyrus and/or the inferior parietal lobule [55–57]. A few studies have investigated neural correlates of the VTE in the human brain, in relation to this dorsal stream, considering that this type of study could provide information about the potential role of articulatory processes in the VTE.
(a) Neuroanatomy of the verbal transformation effect
In a functional MRI (fMRI) study using a block design, Sato et al.  compared two conditions, a verbal transformation condition involving the mental repetition of speech sequences (e.g. /ps/-/ps/ and /sp/-/ps/) while actively searching for verbal transformations, and a baseline condition involving simple repetition of the same items. In the verbal transformation condition, subjects were instructed to change what they mentally repeated when they perceived a verbal transformation. For example, for the /sp/ sequence in the verbal transformation condition, subjects mentally repeated /sp/ until the perceptual emergence of /ps/, then they continued mentally repeating /ps/ while waiting for the emergence of the percept /sp/, and so on. In contrast, for the /sp/ sequence in the baseline condition they just had to continue mentally repeating /sp/ all along the epoch. There was left-hemisphere activity related to the verbal transformation task (VTE condition–baseline) within the inferior frontal, the supramarginal and the superior temporal gyri and the anterior part of the insular cortex. Activity was also observed within the right anterior cingulate cortex and the cerebellum bilaterally. The authors suggested that this temporo-parieto-frontal network performs online analysis of the repeated speech sequence and the temporary storage of the resulting representations. In addition, the authors proposed that the emergence of new representations may involve syllable parsing in the left inferior frontal gyrus and competition between representations in the anterior cingulate cortex.
Using purely auditory tasks and event-related fMRI, Kondo & Kashino  contrasted a verbal transformation and a tone detection condition (the speech sequence used in the verbal transformation condition was /banana/). Participants were asked to report their perceptual changes by pressing a button. The same stimulus was used in the tone detection condition except that a tone pip was emitted in the background. The tone detection condition was an ‘emulation’ of the verbal transformation condition, i.e. for each participant, the tone pips were timed to follow the time course of switches reported in the previous verbal transformation condition. Thus, the number of responses matched as closely as possible that for the verbal transformation condition and the motor responses were nearly identical in both conditions (for the same kind of procedure, see e.g. ). Participants were asked to press a button when the tone pip was presented. For both conditions, bilateral activation was observed within the primary auditory cortex, and the posterior part of the superior temporal and the supramarginal gyri. Additional activity was observed in the left insular cortex. However, the anterior cingulate cortex, the prefrontal cortex and the left inferior frontal gyrus were activated only for the verbal transformation task. The authors found a positive correlation between the number of transformations and the intensity of the observed signal in the left inferior frontal gyrus. According to the authors, this finding may reflect the role of predictive processes based on articulatory gestures in the generation of verbal transformations which are updated in the left inferior frontal cortex (an area involved in speech production). Conversely, a negative correlation was observed between the activity of the anterior part of the left cingulate cortex and the number of transformations. The authors noted the role of the dorsal part of the cingulate cortex in the stabilization of percepts and in error detection. They proposed that this area might play a role in matching between possible verbal forms and auditory input in the VTE.
Altogether, a common network emerges from these two studies, basically corresponding to the organization of the dorsal stream. However, no temporal information is provided by fMRI studies about when the brain areas involved in the VTE are activated. The objective of the experiment described in the next section was to provide such information.
(b) Neurophysiological correlates of the decision process in the verbal transformation effect
Basirat et al.  investigated the temporal dynamics of verbal transformations using intracranial EEG (iEEG) recordings by means of electrodes implanted inside the brain of two epileptic patients as part of their presurgical evaluation. This method provides a high temporal and spatial resolution. Two experimental conditions were used: the verbal transformation condition (ENDO, for endogenously driven perceptual switch) and an auditory change condition (EXO, for exogenously driven perceptual switch). In the ENDO condition, the participants were asked to listen to the auditory stimulus /patapata … / or /tapatapa … / and to press a button whenever they perceived any change in the stimulus. In the EXO condition, real random changes between /pa/ and /ta/ were presented (i.e. / … papa … tata … papa … /). Participants were asked to report any changes in percept by pressing a button. A time-frequency analysis of iEEG responses showed an increase of gamma band activity (above 40 Hz) in the left inferior frontal and supramarginal gyri 300–800 ms before the button press in the ENDO condition. In the EXO condition, gamma band activity was found 200 ms prior to the button press, mainly in the left superior temporal and supramarginal gyri. The authors reasoned that gamma band increases in the ENDO condition could not be due to the preparation for the motor response per se, since these activities were observed much earlier than in the EXO condition. This result confirms the specific role of the observed parieto-frontal network in perceptual switches and decision-making.
Despite different imaging methods and different materials and tasks used in the three studies reviewed above, quite similar areas were observed in relation to verbal transformations. These areas, including the left inferior frontal, left supramarginal and left superior temporal gyri, are consistent with the dorsal stream of speech processing. Altogether, the neuroimaging results on the VTE reviewed above support the idea that the speech motor system is activated in speech perception, in agreement with a possible role of articulatory-based representations in the emergence and the stabilization of speech percepts.
We now return to the questions raised at the beginning of this paper, to summarize what has been learned about the perceptual organization of speech through the use of the VTE.
(a) Perceptuo-motor speech ‘chunks’
The question of possible ‘natural’ units in speech perception is old and complex. Syllables have been repeatedly considered as possible candidates, although their role remains controversial [62–64]. Auditory processing provides some bases for enhancing elements of syllable structures, through the processing of modulations (see e.g. [65,66]) or the detection of auditory events (e.g. ). The data presented in this paper suggest that ‘chunks' could emerge from a variety of cues, including articulatory and visual ones. Section 3 showed how ‘articulatory chunking’ principles related to inter-articulatory synchrony or underlying jaw cycles could provide a ‘glue’ for sticking together pieces of acoustic information. Section 4 pointed towards the potential role of visual information and particularly of speech visual onsets. Overall, we suggest that the perceptual organization of the speech stream could involve perceptuo-motor chunks formed on the basis of articulatory, auditory and visual information. These chunks would serve as a basis for further processing by higher-level mechanisms involved in comprehension. This would extend classical facts about speech perception, namely the pivotal role of syllables in the auditory processing of speech (e.g. ), the role of syllable onsets in speech segmentation  and the importance of word onsets in lexical access and speech comprehension (e.g. ). It could provide a basis for future experiments on the role of visual speech in lexical access (see ), particularly concerning audiovisual detection of word onsets.
Overall, the behavioural and neurophysiological data presented in this paper suggest that the perceptual organization of speech is based on perceptuo-motor and multimodal processes. This fits well with the perception-for-action-control theory (PACT) developed by Schwartz et al. , which is based on the assumption of two roles for perceptuo-motor links (possibly implemented in the human brain within the ‘dorsal stream’) during speech perception. The first involves co-structuring of perceptual and motor representations, enabling auditory categorization mechanisms to take motor information into account (dotted line in figure 3). The second (solid lines in figure 3) is related to speech analysis per se: it represents the contribution provided by motor knowledge to the speech scene analysis process, as demonstrated, among other things, by the present data on the VTE. The ‘articulatory chunking’ phenomenon is related to this second role of perceptuo-motor processes during speech perception.
Coming back to the issues raised by Bregman  and Remez et al.  about the principles of speech scene analysis, should these perceptuo-motor chunks be considered as primitives or schemas? If we consider that one main difference between primitive and schema-based mechanisms is that the latter need learning, articulatory chunking (which is based on motor knowledge) should be considered as a schema-based process. However, the mechanism based on articulatory stability could be quite general, and possibly learned very early, and perhaps even present in a speech communication module genetically specified in the human brain and available in pre-linguistic infants. In their review of the perception of speech, Pardo and Remez  claim that the perceptual organization of speech does not require learning and quote evidence from studies of pre-linguistic (14-week-old) infants, who are able to integrate acoustic elements of speech even when they are spectrally and spatially disparate. Further experiments on the potential existence of articulatory stability and articulatory chunking in infants should shed some light on this question (see e.g. ).
(b) The verbal transformation effect in relation to a general framework for multistability
We now attempt to connect the tentative portrait of the VTE that we propose in this paper with a global framework for multistability, as it emerges from the literature and particularly from the present special issue. The framework we consider is based on the ‘predictive coding’ approach, according to which perception involves active prediction (or synthesis) of sensory input. Prediction is compared against the input signal; an error signal is then fed back to the perceptual system, which enables the re-evaluation of the sensory input. Predictive coding has become an influential hypothesis about how the brain deals with sensory information coming from the external world, which is usually noisy and ambiguous (for a review, see ).
Predictive coding has a special appeal in speech perception, since it provides general principles supporting a model introduced long ago in the literature, the so-called ‘analysis-by-synthesis’ model of speech perception . Analysis-by-synthesis is based on the assumption that speech perception involves analysing the sound by assessing the possible articulatory commands that generated it. Predictive coding has been considered as a possible model for explaining neurophysiological data on the influence of vision on auditory processing in speech perception : the natural dynamics of audiovisual speech as well as phonological knowledge would allow the speech-processing system to build an online prediction of auditory signals . This prediction system would involve a motor-based ‘analysis-by-synthesis’ mechanism associated with a temporo-parieto-frontal network in the brain . In a more general way, PACT has extended this concept to speech scene analysis (see solid lines in figure 3) .
The neuroimaging findings on the VTE reviewed in §5 seem to fit well with this framework. The enhanced parietal and frontal activity before verbal transformations observed by Basirat et al.  may contribute to hypothesis generation and error minimization mechanisms: a verbal transformation happens when the prediction error becomes large, so that a new hypothesis (with a smaller prediction error) is considered as a better choice. A positive correlation between the number of transformations and the intensity of the observed signal in the left inferior frontal gyrus reported by Kondo & Kashino  is also consistent with this interpretation. We cannot determine at this point whether enhanced frontal and parietal activity reflects prediction or a prediction-error signal. The interpretation of these data based on predictive coding does not exclude a contribution of stimulus-driven mechanisms such as adaptation (see also ). One possibility is that the prediction-error signal in the VTE is modulated by neural fatigue (in low-level brain areas) and then sent to high-level areas for generation of a better prediction, based on articulatory representations. This suggestion is speculative, considering the small amount of neuroimaging data on the VTE. Further studies on the interaction of general ‘low-level’ mechanisms in speech perception (including e.g. onset detection, extraction of spectro-temporal primitives and sensory adaptation) and perceptuo-motor links in the VTE would be helpful to investigate this.
It has been suggested that similar principles govern auditory and visual multistability . Predictive coding is actually mentioned in this issue by several authors to explain how various perceptual configurations could compete in the human brain and how the dynamics of perceptual transitions could be predicted on the basis of this kind of model [80–83]. The VTE could then be conceived as a ‘speech-specific’ form of this general scenario. Although the idea that the same kind of mechanisms may be involved in the VTE and in perceptual multistability in other modalities may seem speculative, neuroimaging findings provide some hints. Several recent studies observed the activation of parietal and frontal areas by visual multistable stimuli (for a review, see ). In an influential paper on perceptual multistability, Leopold & Logothetis  have suggested that these sensorimotor areas are involved in re-evaluating the current perceptual interpretation, in both normal and multistable vision. Their intervention becomes noticed when there is ambiguity in the visual input, that is when several relevant interpretations are possible. The authors suggest that this re-evaluation mechanism might be based on an iterative and random system of ‘checks and balances’: the role of parieto-frontal areas would be to periodically force perception to reorganize, which may lead to perceptual switches during multistability tasks. Although the data considered by the authors were all focused on visual multistability, the data reviewed in §5 on the role of the dorsal stream in the VTE are consistent with this mechanism. By this reasoning, the perceptuo-motor links in speech analysis and perception form a speech-specific component of a more general mechanism enabling the control of meaningful perception of sensory input in parieto-frontal regions. This general mechanism, in which the re-evaluation of a scene by parieto-frontal areas is initiated by a signal received from sensory areas, is compatible with ‘analysis-by-synthesis’ for speech perception (and predictive coding ideas in general).
We return to the last question raised at the beginning of this paper: what is the nature of speech perceptual objects? This question is a kind of ‘Holy Grail’ for speech communication researchers. Kubovy & Van Valkenburg  suggested that ‘a perceptual object is that which is susceptible to figure-ground segregation’: principles of perceptual grouping produce possible perceptual objects, attention selects one or a set of these objects to become figure and assigns all other information from the sensory scene to ground. These are figures that represent perceptual objects. In their view (see also ), visual and auditory objects are, respectively, formed in a space–time and a pitch-time space. What, then, are the ‘metrics’ for defining speech objects? Although the answer to this question is far beyond the scope of this paper, we suggest that motor principles might contribute to speech object formation. Whatever the view on this, the VTE provides an interesting tool for studying the nature of speech objects.
We thank Brian Moore and Mark Pitt for their comments, suggestions and corrections of previous versions of this manuscript. This work was supported by the Centre National de la Recherche Scientifique (CNRS) and the Agence Nationale de la Recherche (ANR-08-BLAN-0167-01, project Multistap).
One contribution of 10 to a Theme Issue ‘Multistability in perception: binding sensory modalities’.
↵1 In this second analysis, we focused on the duration of /pata/ and /tapa/, without including the duration of ‘other’ transformations. The delta score was, therefore, calculated as the difference between the global stability durations of /pata/ and /tapa/ normalized by the sum of the global stability durations of /pata/ and /tapa/ (instead of normalizing by the total duration of all transformations as in the first analysis).
- This journal is © 2012 The Royal Society