When people speak with one another, they tend to adapt their head movements and facial expressions in response to each others' head movements and facial expressions. We present an experiment in which confederates' head movements and facial expressions were motion tracked during videoconference conversations, an avatar face was reconstructed in real time, and naive participants spoke with the avatar face. No naive participant guessed that the computer generated face was not video. Confederates' facial expressions, vocal inflections and head movements were attenuated at 1 min intervals in a fully crossed experimental design. Attenuated head movements led to increased head nods and lateral head turns, and attenuated facial expressions led to increased head nodding in both naive participants and confederates. Together, these results are consistent with a hypothesis that the dynamics of head movements in dyadicconversation include a shared equilibrium. Although both conversational partners were blind to the manipulation, when apparent head movement of one conversant was attenuated, both partners responded by increasing the velocity of their head movements.
When people converse, they adapt their movements, facial expressions and vocal cadence to one another. This multi-modal adaptation allows the communication of information that either reinforces or is in addition to the information that is contained in the semantic verbal stream. For instance, back-channel information such as direction of gaze, head nods and ‘uh-huh’s allow the conversants to better segment speaker–listener turn taking. Affective displays such as smiles, frowns, expressions of puzzlement or surprise, shoulder movements, head nods and gaze shifts are components of the multi-modal conversational dialogue.
When two people adopt similar poses, this could be considered a form of spatial symmetry (Boker & Rotondo 2002). Interpersonal symmetry has been reported in many contexts and across sensory modalities: for instance, patterns of speech (Cappella & Panalp 1981, Neumann & Strack 2000), facial expression (Hsee et al. 1990) and laughter (Young & Frye 1966). Increased symmetry is associated with increased rapport and affinity between conversants (LaFrance 1982; Bernieri 1988). Intrapersonal and cross-modal symmetry may also be expressed. Smile intensity is correlated with cheek raising in smiles of enjoyment (Messinger et al. 2009) and with head pitch and yaw in embarassment (Cohn et al. 2004; Ambadar et al. 2009). The structure of intrapersonal symmetry may be complex: self-affine multi-fractal dimension in head movements change based on conversational context (Ashenfelter et al. 2009).
Symmetry in movements implies redundancy in movements, which can be defined as negative Shannon information (Shannon & Weaver 1949; Redlich 1993). As symmetry is formed between conversants, the ability to predict the actions of one based on the actions of the other increases. When symmetry is broken by one conversant, the other is likely to be surprised or experience change in attention. The conversant's previously good predictions would now be much less accurate. Breaking symmetry may be a method for increasing the transmission of non-verbal information by reducing the redundancy in a conversation.
This view of an ever-evolving symmetry between two conversants may be conceptualized as a dynamical system with feedback as shown in figure 1. Motor activity (e.g. gestures, facial expression or speech) is produced by one conversant and perceived by the other. These perceptions contribute to some system that functions to map the perceived actions of the interlocutor onto potential action: a mirror system. Possible neurological candidates for such a mirror system have been advanced by Rizzolati and colleagues (Iacoboni et al. 1999; Rizzolatti & Craighero 2004; Rizzolatti & Fadiga 2007) who argued that such a system is fundamental to communication.
Conversational movements are likely to be non-stationary (Boker et al. 2002; Ashenfelter et al. in press) and involve both symmetry formation and symmetry breaking (Boker & Rotondo 2002). One technique that is used in the study of non-stationary dynamical systems is to induce a known perturbation into a free running system and measure how the system adapts to the perturbation. In the case of facial expressions and head movements, one would need to manipulate conversant A's perceptions of the facial expressions and head movements of conversant B, while conversant B remained blind to these manipulations as illustrated in figure 2.
Recent advances in active appearance models (AAMs) (Cootes et al. 2002) have allowed the tracking and re-synthesis of faces in real-time (Matthews & Baker 2004). Placing two conversants into a videoconference setting provides a context in which a real-time AAM can be applied, since each conversant is facing a video camera and each conversant only sees a video image of the other person. One conversant could be tracked and the desired manipulations of head movements and facial expressions could be applied prior to re-synthesizing an avatar that would be shown to the other conversant. In this way, a perturbation could be introduced as shown in figure 2.
To test the feasibility of this paradigm and to investigate the dynamics of symmetry formation and breaking, we present the results of an experiment in which we implemented a mechanism for manipulating head movement and facial expression in real time during a face-to-face conversation using a computer-enhanced videoconference system. The experimental manipulation was not noticed by naive participants, who were informed that they would be in a videoconference and that we had ‘cut out’ the face of the person with whom they were speaking. No participant guessed that he or she was actually speaking with a synthesized avatar. This manipulation revealed the co-regulation of symmetry formation and breaking in two-person conversations.
2. Material and methods
Videoconference booths were constructed in two adjacent rooms. Each 1.5 m × 1.2 m footprint booth consisted of a 1.5 m × 1.2 m backprojection screen, two 1.2 m × 2.4 m non-ferrous side walls covered with white fabric, and a white fabric ceiling. Each participant sat on a stool approximately 1.1 m from the backprojection screen as shown in figure 3. Audio was recorded using Earthworks directional microphones through a Yamaha 01V96 multi-channel digital audio mixer. National Television System Committee format video was captured using Panasonic IK-M44H ‘lipstick’ colour video cameras and recorded to two JVC BR-DV600U digital video decks. Society of Motion Picture and Television Engineers time stamps generated by an ESE 185-U master clock were used to maintain a synchronized record on the two video recorders and to synchronize the data from a magnetic motion capture device. Head movements were tracked and recorded using an Ascension Technologies MotionStar magnetic motion tracker sampling at 81.6 Hz from a sensor attached to the back of the head using an elastic headband. Each room had an extended range transmitter whose fields overlapped through the non-ferrous wall separating the two video booth rooms.
To track and re-synthesize the avatar, video was captured by an AJA Kona card in an Apple 2-core 2.5 GHz G5 PowerMac with 3 Gb of RAM and 160 Gb of storage. The PowerMac ran software described below and output the resulting video frames to an InFocus IN34 DLP Projector. Thus, the total delay time from the camera in booth 1 through the avatar synthesis process and projected to booth 2 was 165 ms. The total delay time from the camera in booth 2 to the projector in booth 1 was 66 ms, because the video signal was passed directly from booth 2 to booth 1 and did not need to go through a video analogue/digital and avatar synthesis. For the audio manipulations described below, we reduced vocal pitch inflection using a TC-Electronics VoiceOne Pro. Audio–video synchronization was maintained using digital delay lines built into the Yamaha 01V96 mixer.
(b) Active appearance models
AAMs (Cootes et al. 2001) are generative, parametric models commonly used to track and synthesize faces in video sequences. Recent improvements in both the fitting algorithms and the hardware on which they run allow tracking (Matthews & Baker 2004) and synthesis (Theobald et al. 2007) of faces in real-time.
The AAM is formed of two compact models: one describes variation in shape and the other variation in appearance. AAMs are typically constructed by first defining the topological structure of the shape (the number of landmarks and their interconnectivity to form a two-dimensional triangulated mesh), then annotating with this mesh a collection of images that exhibit the characteristic forms of the variation of interest. For this experiment, we label a subset of 40–50 images (less than 0.2% of the images in a single session) that are representative of the variability in facial expression. An individual shape is formed by concatenating the coordinates of the corresponding mesh vertices, , so the collection of training shapes can be represented in matrix form as . Applying principal component analysis (PCA) to these shapes, typically aligned to remove in-plane pose variation, provides a compact model of the form 2.1where s0 is the mean shape and the vectors si are the eigenvectors corresponding to the m largest eigenvalues. These eigenvectors are the basis vectors that span the shape space and describe variation in the shape about the mean. The coefficients pi are the shape parameters, which define the contribution of each basis in the reconstruction of s. An alternative interpretation is that the shape parameters are the coordinates of s in shape space, thus each coefficient is a measure of the distance from s0 to s along the corresponding basis vector.
The appearance of the AAM is a description of the variation estimated from a shape-free representation of the training images. Each training image is first warped from the manually annotated mesh location to the base shape, so the appearance comprises the pixels that lie inside the base mesh, . PCA is applied to these images to provide a compact model of appearance variation of the form 2.2where the coefficients λi are the appearance parameters, A0 is the base appearance and the appearance images, Ai, are the eigenvectors corresponding to the l largest eigenvalues. As with shape, the eigenvectors are the basis vectors that span appearance space and describe variation in the appearance about the mean. The coefficients λi are the appearance parameters, which define the contribution of each basis in the reconstruction of A(x). Because the model is invertible, it may be used to synthesize new face images (see figure 4).
(c) Manipulating facial displays using AAMs
To manipulate the head movement and facial expression of a person during a face-to-face conversation such that they remain blind to the manipulation, an avatar is placed in the feedback loop, as shown in figure 2. Conversants speak via a videoconference and an AAM is used to track and parameterize the face of one conversant.
As outlined, the parameters of the AAM represent displacements from the origin in the shape and appearance space. Thus, scaling the parameters has the effect of either exaggerating or attenuating the overall facial expression encoded as AAM parameters 2.3where β is a scalar, which when greater than unity exaggerates the expression and when less than unity attenuates the expression. An advantage of using an AAM to conduct this manipulation is that a separate scaling can be applied to the shape and appearance to create some desired effect. We stress here that in these experiments, we are not interested in manipulating individual actions on the face (e.g. inducing an eye-brow raise), rather we wish to manipulate, in real time, the overall facial expression produced by one conversant during the conversation.
The second conversant does not see the video of the person to whom they are speaking. Rather, they see a re-rendering of the video from the manipulated AAM parameters as shown in figure 5. To re-render the video using the AAM, the shape parameters , are first applied to the model, equation (2.3), to generate the shape, s, of the AAM, followed by the appearance parameters to generate the AAM image, A(x). Finally, a piece-wise affine warp is used to warp A(x) from s0 to s, and the result is transferred into image coordinates using a similarity transform (i.e. movement in the x–y plane, rotation and scale). This can be achieved efficiently, at video frame rate, using standard graphics hardware.
Typical example video frames synthesized using an AAM before and after damping are shown in figure 6. Note that the effect of the damping is to reduce the expressiveness. Our interest here is to estimate the extent to which manipulating expressiveness in this way can affect the behaviour during conversation.
Naive participants (n = 27, 15 male, 12 female) were recruited from the psychology department participant pool at a midwestern university. Confederates (n = 6, three male and three female) were undergraduate research assistants. AAM models were trained for the confederates so that the confederates could act as one conversant in the dyad. Confederates were informed of the purpose of the experiment and the nature of the manipulations, but were blind to the order and timing of the manipulations. All confederates and naive participants read and signed informed consent forms approved by the Institutional Review Board.
We attenuated three variables: (i) head pitch and turn: translation and rotation in image coordinates from their canonical values by either 1.0 or 0.5; (ii) facial expression: the vector distance of the AAM shape parameters from the canonical expression (by multiplying the AAM shape parameters by either 1.0 or 0.5); and (iii) audio: the range of frequency variability in the fundamental frequency of the voice (by using the VoicePro to either restrict or not restrict the range of the fundamental frequency of the voice) in a fully crossed design. Naive participants were given a cover story that video was ‘cut out’ around the face and then participated in two 8 min conversations, one with a male and another with a female confederate. Prior to debrief, the naive participants were asked if they ‘noticed anything unusual about the experiment’. None mentioned that they thought they were speaking with a computer generated face or noted the experimental manipulations.
(f) Data reduction and analysis
Angles of the Ascension Technologies head sensor in the anterior–posterior (A–P) and lateral directions (i.e. pitch and yaw, respectively) were selected for analysis. These directions correspond to the meaningful motion of a head nod and a head turn, respectively. We focus on angular velocity since this variable can be thought of as how animated a participant was during an interval of time.
To compute angular velocity, we first converted the head angles into angular displacement by subtracting the mean overall head angle across a whole conversation from each head angle sample. We used the overall mean head angle since this provided an estimate of the overall equilibrium head position for each participant independent of the trial conditions. Second, we low-pass filtered the angular displacement time series and calculated angular velocity using a quadratic filtering technique (generalized local linear approximation; Boker et al. in press), saving both the estimated displacement and the velocity for each sample. The root mean square (RMS) of the lateral and A–P angular velocity was then calculated for each 1 min condition of each conversation for each naive participant and confederate.
Because the head movements of each conversant both influence and are influenced by the movements of the other, we seek an analytic strategy that models bidirectional effects (Kenny & Judd 1986). Specifically, each conversant's head movements are both a predictor variable and an outcome variable. Neither can be considered to be an independent variable. In addition, each naive participant was engaged in two conversations, one with each of the two confederates. Each of these sources of non-independence in dyadic data needs to be accounted for in a statistical analysis.
To put both conversants in a dyad into the same analysis we used a variant of Actor–Partner analysis (Kashy & Kenny 2000; Kenny et al. 2006). Suppose we are analysing RMS-V angular velocity. We place both the naive participants' and the confederates' RMS-V angular velocity into the same column in the data matrix and use a second column as a dummy code labelled ‘confederate’ to identify whether the data in the angular velocity column came from a naive participant or a confederate. In a third column, we place the RMS-V angular velocity from the other participant in the conversation. We then use the terminology ‘actor’ and ‘partner’ to distinguish which variable is the predictor and which is the outcome for a selected row in the data matrix. If confederate = 1, then the confederate is the ‘actor’ and the naive participant is the ‘partner’ in that row of the data matrix. If confederate = 0, then the naive participant is the ‘actor’ and the confederate is the ‘partner.’ We coded the sex of the ‘actor’ and the ‘partner’ as a binary variables (0 = female, 1 = male). The RMS angular velocity of the ‘partner’ was used as a continuous predictor variable.
Binary variables were coded for each manipulated condition: attenuated head pitch and turn (0 = normal, 1 = 50% attenuation), and attenuated expression (0 = normal, 1 = 50% attenuation). Because only the naive participant sees the manipulated conditions, we also added interaction variables (confederate × delay condition and confederate × sex of partner), centering each binary variable prior to multiplying. The manipulated condition may affect the naive participant directly, but may also affect the confederate indirectly through changes in the behaviour of the naive participant. The interaction variables allow us to account for an overall effect of the manipulation as well as possible differences between the reactions of the naive participant and of the confederate.
We then fit mixed effect models using restricted maximum likelihood. Because there is non-independence of rows in this data matrix, we need to account for this non-independence. An additional column is added to the data matrix that is coded by experimental session and then the mixed effects model of the data is grouped by the experimental session column (both conversations in which the naive participant engaged). Each session was allowed a random intercept to account for individual differences between experimental sessions in the overall RMS velocity. This mixed effects model can be written as 2.4 2.5where yij is the outcome variable (lateral or A–P RMS velocity) for condition i and session j. The other predictor variables are the sex of the actor Aij, the sex of the partner Pij, whether the actor is the confederate Cij, the head pitch and turn attenuation condition Hij, the facial expression attenuation condition Fij, the vocal inflection attenuation condition Vij and the lateral or A–P RMS velocity of the partner Zij. As each session was allowed to have its own intercept, the predictions are relative to the overall angular velocity associated with each naive participant's session.
The results of a mixed effects random intercept model grouped by session predicting A–P RMS angular velocity of the head are displayed in table 1. As expected from previous reports, males exhibited lower A–P RMS angular velocity than females, and when the conversational partner was male there was lower A–P RMS angular velocity than when the conversational partner was female. Confederates exhibited lower A–P RMS velocity than naive participants, although this effect only just reached significance at the α = 0.05 level. Both attenuated head pitch and turn, and facial expression was associated with greater A–P angular velocity: both conversants nodded with greater vigour when either the avatar's rigid head movement or the facial expression was attenuated. Thus, the naive participant reacted to the attenuated movement of the avatar by increasing her or his head movements. Also, the confederate (who was blind to the manipulation) reacted to the increased head movements of the naive participant by increasing his or her head movements. When the avatar attenuation was in effect, both conversational partners adapted by increasing the vigour of their head movements. There were no effects of either the attenuated vocal inflection or the A–P RMS velocity of the conversational partner. Only one interaction reached significance—confederates had a greater reduction in A–P RMS angular velocity when speaking to a male naive participant than the naive participants had when speaking to a male confederate.
The results for RMS lateral angular velocity of the head are displayed in table 2. As was true in the A–P direction, males exhibited less lateral RMS angular velocity than females, and conversants exhibited less lateral RMS angular velocity when speaking to a male partner. Confederates again exhibited less velocity than naive participants. Attenuated head pitch and turn was again associated with greater lateral angular velocity: participants turned away or shook their heads either more often or with greater angular velocity when the avatar's head pitch and turn variation was attenuated. However, in the lateral direction, we found no effect of the facial expression or the vocal inflection attenuation. There was an independent effect such that lateral head movements were negatively coupled. That is to say in 1 min blocks when one conversant's lateral angular movement was more vigorous, their conversational partner's lateral movement was reduced. Again, only one interaction reached significance—confederates had a greater reduction in A–P RMS angular velocity when speaking to a male naive participant than the naive participants had when speaking to a male confederate. There are at least three differences between the confederates and the naive participants that might account for this effect: (i) the confederates have more experience in the video booth than the naive participants and may thus be more sensitive to the context provided by the partner as the overall context of the video booth is familiar; (ii) the naive participants are seeing an avatar and it may be that there is an additional partner sex effect of seeing a full body video over seeing a ‘floating head’; and (iii) the reconstructed avatars have a reduced number of eye blinks than the video since some eye blinks are not caught by the motion tracking.
Automated facial tracking was successfully applied to create real-time re-synthesized avatars that were accepted as being video by naive participants. No participant guessed that we were manipulating the apparent video in their videoconference conversations. This technological advance presents the opportunity for studying adaptive facial behaviour in natural conversation while still being able to introduce experimental manipulations of rigid and non-rigid head movements without either participant knowing the extent or timing of these manipulations.
The damping of head movements was associated with increased A–P and lateral angular velocity. The damping of facial expressions was associated with increased A–P angular velocity. There are several possible explanations for these effects. During the head movement attenuation condition, naive participants might perceive the confederate as looking more directly at him or her, prompting more incidents of gaze avoidance. A conversant might not have received the expected feedback from an A–P or lateral angular movement of a small velocity and adapted by increasing her or his head angle relative to the conversational partner in order to elicit the expected response. Naive participants may have perceived the attenuated facial expressions of the confederate as being non-responsive and attempted to increase the velocity of their head nods in order to elicit greater response from their conversational partners.
As none of the interaction effects for the attenuated conditions were significant, the confederates exhibited the same degree of response to the manipulations as the naive participants. Thus, when the avatar's head pitch and turn variation was attenuated, both the naive participant and the confederate responded with increased velocity head movements. This suggests that there is an expected degree of matching between the head velocities of the two conversational partners. Our findings provide evidence in support of a hypothesis that the dynamics of head movement in dyadic conversation include a shared equilibrium: both conversational partners were blind to the manipulation and when we perturbed one conversant's perceptions, both conversational partners responded in a way that compensated for the perturbation. It is as if there were an equilibrium energy in the conversation and when we removed energy by attenuation, and thus changed the value of the equilibrium, the conversational partners supplied more energy in response and thus returned the equilibrium towards its former value.
These results can also be interpreted in terms of symmetry formation and symmetry breaking. The dyadic nature of the conversants' responses to the asymmetric attenuation conditions is evidence of symmetry formation. But head turns have an independent effect of negative coupling, where greater lateral angular velocity in one conversant was related to reduced angular velocity in the other: evidence of symmetry breaking. Our results are consistent with symmetry formation being exhibited in both head nods and head turns, while symmetry breaking being more related to head turns. In other words, head nods may help form symmetry between conversants while head turns, contribute to both symmetry formation and symmetry breaking. One argument for why these relationships would be observed is that head nods may be more related to acknowledgement or attempts to elicit expressivity from the partner, whereas head turns may be more related to new semantic information in the conversational stream (e.g. floor changes) or to signals of disagreement or withdrawal.
With the exception of some specific expressions (e.g. Keltner 1995; Ambadar et al. 2009), previous research has ignored the relationship between head movements and facial expressions. Our findings suggest that facial expression and head movement may be closely related. These results also indicate that the coupling between one conversant's facial expressions and the other conversant's head movements should be taken into account. Future research should inquire into these within-person and between-person cross-modal relationships.
The attenuation of facial expression created an effect that appeared to the research team as being that of someone who was mildly depressed. Decreased movement is a common feature of psychomotor retardation in depression, and depression is associated with decreased reactivity to a wide range of positive and negative stimuli (Rottenberg 2005). Individuals with depression or dysphoria, in comparison with non-depressed individuals, are less likely to smile in response to pictures or movies of smiling faces and affectively positive social imagery (Gehricke & Shapiro 2000; Sloan et al. 2002). When they do smile, they are more likely to damp their facial expression (Reed et al. 2007).
Attenuation of facial expression can also be related to cognitive states or social context. For instance, if one's attention is internally focused, the attenuation of facial expression may result. Interlocutors might interpret damped facial expression of their conversational partner as reflecting a lack of attention to the conversation.
Naive participants responded to damped facial expression and head turns by increasing their own head nods and head turns, respectively. These effects may have been efforts to elicit more responsive behaviour in the partner. In response to simulated maternal depression by their mother, infants attempt to elicit a change in their mother's behaviour by smiling, turning away and then turning again towards her and smiling. When they fail to elicit a change in their mothers' behaviour, they become withdrawn and distressed (Cohn & Tronick 1983). Similarly, adults find exposure to prolonged depressed behaviour increasingly aversive and withdraw (Coyne 1976). Had we attenuated facial expression and head motion for more than a minute at a time, naive participants might have become less active following their failed efforts to elicit a change in the confederate's behaviour. This hypothesis remains to be tested.
There are a number of limitations of this methodology that could be improved with further development. For instance, while we can manipulate the degree of expressiveness as well as the identity of the avatar (Boker & Cohn in press), we cannot yet manipulate specific facial expressions in real time. Depression not only attenuates expression, but also makes some facial actions, such as contempt, more likely (Ekman et al. 2005; Cohn et al. 2009). As an analogue for depression, it would be important to manipulate specific expressions in real time. In other contexts, cheek raising (AU 6 in the facial action coding system) (Ekman et al. 2002) is believed to covary with communicative intent and felt emotion (Coyne 1976). In the past, it has not been possible to experimentally manipulate discrete facial actions in real time without the source person's awareness. If this capability could be implemented in the videoconference paradigm, it would make possible a wide range of experimental tests of emotion signalling.
Other limitations include the need for person-specific models, restrictions on head rotation and limited face views. The current approach requires manual training of face models, which involves hand labelling about 30–50 video frames. Because this process requires several hours of pre-processing, avatars could be constructed for confederates but not for unknown persons, such as naive participants. It would be useful to have the capability of generating real-time avatars for both conversation partners. Recent efforts have made progress towards this goal (Lucey et al. in press; Saragih et al. 2009). Another limitation is that if the speaker turns more than about 20° from the camera, parts of the face become obscured and the model can no longer track the remainder of the face. Algorithms have been proposed that address this issue (Gross et al. 2004), but it remains a research question. Another limitation is that the current system has modelled the face only from the eyebrows to the chin. A better system would include the forehead, and some model of the head, neck, shoulders and background in order to give a better sense of the placement of the speaker in context. Adding forehead features is relatively straight-forward and has been implemented. Tracking of neck and shoulders is well advanced (Sheikh et al. 2008). The videoconference avatar paradigm has motivated new work in computer vision and graphics and made possible new methodology to experimentally investigate social interaction in a way not possible before. The timing and identity of social behaviour in real time can now be rigorously manipulated outside of participants' awareness.
We presented an experiment that used automated facial and head tracking to perturb the bidirectionally coupled dynamical system formed by two individuals speaking with one another over a videoconference link. The automated tracking system allowed us to create re-synthesized avatars that were convincing to naive participants and, in real time, to attenuate head movements and facial expressions formed during natural dyadic conversation. The effect of these manipulations exposed some of the complexity of multi-modal coupling of movements during face to face interactions. The experimental paradigm presented here has the potential to transform social psychological research in dyadic and small group interactions owing to an unprecedented ability to control the real-time appearance of facial structure and expression.
All confederates and naive participants read and signed informed consent forms approved by the Institutional Review Board.
Preparation of this manuscript was supported in part by NSF grant BCS05 27397, EPSRC grant EP/D049075 and NIMH grant MH 51435. Any opinions, findings and conclusions or recommendations expressed in this material are those of the authors and do not necessarily reflect the views of the National Science Foundation. We gratefully acknowledge the help of Kathy Ashenfelter, Tamara Buretz, Eric Covey, Pascal Deboeck, Katie Jackson, Jen Koltiska, Sean McGowan, Sagar Navare, Stacey Tiberio, Michael Villano and Chris Wagner.
One contribution of 17 to a Discussion Meeting Issue ‘Computation of emotions in man and machines’.
- © 2009 The Royal Society