Robotics has traditionally focused on developing intelligent machines that can manipulate and interact with objects. The promise of personal robots, however, challenges researchers to develop socially intelligent robots that can collaborate with people to do things. In the future, robots are envisioned to assist people with a wide range of activities such as domestic chores, helping elders to live independently longer, serving a therapeutic role to help children with autism, assisting people undergoing physical rehabilitation and much more. Many of these activities shall require robots to learn new tasks, skills and individual preferences while ‘on the job’ from people with little expertise in the underlying technology. This paper identifies four key challenges in developing social robots that can learn from natural interpersonal interaction. The author highlights the important role that expressive behaviour plays in this process, drawing on examples from the past 8 years of her research group, the Personal Robots Group at the MIT Media Lab.
Studies by the United Nations Economic Commission and International Federation of Robotics forecast a dramatic increase in consumer demand for robots that assist, protect, educate and entertain over the next 20–30 years. In the future, personal robots will be able to help people as capable assistants in their daily activities. Consider cooperative activities such as preparing a meal together, building a structure with teammates or teaching someone a new skill. Through sophisticated forms of social interaction and learning, people are able to accomplish more than they could alone. Socially intelligent robots could have a significant positive impact on real-world challenges, such as helping elders to live independently at home longer, serving as learning companions for children and enriching learning experiences through play, serving a therapeutic role to help children with autism learn communication skills, or functioning as effective members of human–robot teams for disaster response missions, construction tasks and more.
Many of these applications require robots to engage humans in sophisticated forms of social interaction, including human-centred multi-modal communication, teamwork and social forms of learning such as tutelage. Over the past several years, my research has focused on endowing autonomous robots with social intelligence to enable them to engage in the powerful, social forms of interaction and learning that people readily participate. This vision is motivated by the observation that humans are ready-made experts in social interaction; the challenge is to design robots to participate in what comes naturally to people. By doing so, socially interactive robots could help not only specialists, but anyone.
Today, however, autonomous and semi-autonomous robots are widely regarded as tools that trained operators command and monitor to perform tasks. Beyond robustness and proficiency in the physical world, however, the promise of personal robots that can partake in the daily lives of people is pushing robotics and AI research in new directions. Whereas robotics has traditionally focused on developing machines that can manipulate and interact with things, the promise of personal robots challenges us to develop robots that are adept in their interactions with people. Further, in contrast to the traditional view of robots as sophisticated tools that we use to do things for us, this new generation of socially intelligent robots is envisioned as partners that collaborate to do things with us.
Over the past several years, new research fields have emerged (i.e. human–robot interaction and social robotics) to address challenges in building robots that are skilful in their interactions with people (Dautenhahn 1995; Fong et al. 2003; Breazeal 2004b; Duffy 2008). Given that social robots are designed to interact with people in human-centric terms within human environments, many are humanoid (e.g. Tanaka et al. 2004; Ogura et al. 2006) or animal-like (e.g. Fujita 2004; Wada et al. 2005) in form, and even the more mechanical-looking robots tend to have anthropomorphic movement or physical features (e.g. Kozima 2006; Tanaka et al. 2006).
A unifying characteristic is that social robots communicate and coordinate their behaviour with humans through verbal, non-verbal or affective modalities. For instance, these might include whole-body motion (e.g. dancing, Duffy 2003; walking hand-in-hand, Lim et al. 2004), proxemics (i.e. how a robot should approach a person, Walters et al. 2008; follow a person, Gockley et al. 2007; or maintain appropriate interpersonal distance, Brooks & Arkin 2007), gestures (e.g. pointing, shrugging shoulders or shaking hands, Miwa et al. 2004a,b; Roccella et al. 2004), facial expressions (e.g. Iida et al. 1998; DiSalvo et al. 2002; Berns & Hirth 2006; Hayashi et al. 2006), gaze behaviour (e.g. Kikuchi et al. 1998; Sakita et al. 2004; Sidner et al. 2005), head orientation and shared attention (e.g. Imai et al. 2001; Fujie et al. 2004), linguistic and paralinguistic cues (e.g. Matsusaka et al. 2003; Fujie et al. 2005) or emotive vocalization (e.g. Cahn 1990; Abadjieva et al. 1993), social touch-based communication (e.g. Stiehl et al. 2005) and how these cues complement verbal communication (e.g. Cassell et al. 2000).
Progress continues in building robots that can learn from people, through observation, imitation or direct tutelage (for reviews see Schaal 1999; Argall et al. 2009). For instance, impressive strides have been made in designing robots that learn new skills (e.g. pendulum swing-up, Atkeson & Schaal 1997b; body schema, Hersch et al. 2008; peg insertion, Hovland et al. 1996; dance gestures, Mataric et al. 1998; communication skills and protocols, Billard et al. 1998; Roy & Pentland 1998; Scassellati 1998) as well as tasks (e.g. stacking objects, Kuniyoshi et al. 1994; Calinon & Billard 2007; fetch and carry, Nicolescu & Mataric 2003; setting a table, Pardowitz et al. 2007 or sorting objects into bins, Saunders et al. 2006; Chernova & Veloso 2008).
Modern robots are beginning to participate as members of heterogeneous teams that cooperate with people in order to achieve shared goals. For instance, a remote human might supervise a distributed team of robots to perform a task (e.g. disaster response or search and rescue, Bluethmann et al. 2004; Murphy et al. 2008). In addition, co-located teamwork has been explored such as a human and a robot working side by side (Adams et al. 2009), or a team of humans and robots working in the same area to assemble a structure (Fong et al. 2005).
Furthermore, as people begin to interact with robots more closely, it is important that robots' behaviour, rationale and motives be easily understood. The more these mirror natural human analogues, the more intuitive it becomes for us to communicate and coordinate our behaviour with robots. Researchers have begun to explore the role of affect (e.g. Picard 2000; Fellous & Arbib 2005; Duffy 2008; Cañamero), perspective taking and theory of other minds (e.g. Scassellati 2001; Johnson & Demiris 2005; Trafton et al. 2005), and even simple forms of empathy (Dautenhahn 1997; Breazeal et al. 2005a) and models of attachment (Cañamero et al. 2006) in generating a robot's behaviour.
A relevant issue underlying these different kinds of interactions is how people form social judgements of robots—are robots perceived as trustworthy, persuasive, reliable, likeable, etc. (e.g. Kidd & Breazeal 2008; Siegel 2008)? A number of groups have also explored how people's social judgements of robots compare to animated agents and even mixed-reality agents (Holz et al. 2009). It is intriguing that the physical presence of robots seems to matter to people as robots often score higher than their virtual counterparts on measures of engagement, social presence, working alliance as well as social influence on human behaviour (e.g. Kidd & Breazeal 2004; Powers et al. 2007; Bainbridge et al. 2008). Researchers have started delving into functional magnetic resonance imaging studies to try to understand these differences and to what extent people attribute human characteristics to robots, including theory of mind (Krach et al. 2008).
2. Robots that learn from people
Within this broader context of human–robot interaction (HRI) and social robotics, this paper summarizes the past 8 years of research from my group (the Personal Robots Group at the MIT Media Lab; http://www.media.mit.edu/∼cynthiab; http://robotic.media.mit.edu; Breazeal 2002) with respect to significant lessons we have learned in our quest to build robots that can learn from anyone. My group is recognized for pioneering HRI and social robotics through the development of expressive autonomous robots that socially interact with people in a natural manner (Breazeal 2002). Figure 1 presents the three ‘flagship’ social robots we have developed, starting with Kismet in the late 1990s, Leonardo spanning the early–mid 2000s and our new robot Nexi. Each design is considered state-of-the-art (building upon lessons and technologies of its predecessor) and supports a different set of highly related scientific questions at the intersection of emotion and HRI, social learning, sophisticated forms of social cognition and human–robot teamwork.
One of my main research interests has been to develop robots that can learn from natural interpersonal interactions. Personal robots of the future will need to quickly learn new tasks and skills from people who are not specialists in robotics or machine learning techniques but possess a lifetime of experience in teaching and learning from one another. A major technical goal is to engineer robots that can leverage social guidance to efficiently and robustly acquire new capabilities from natural human instruction and to do so dramatically faster than it could alone. As an integral part of this endeavour, my group has contributed new knowledge and findings towards how humans teach social robots, and the important role that the robot's expressive behaviour plays in this interpersonal process.
In contrast to traditional statistical machine learning approaches that require human expertise to craft a successful large-scale search problem that uses little or no real-time human input, my group's approach recognizes the advantage of designing robots that can leverage from the same rich forms of social interaction that people readily use to teach or learn from one another. Human teachers verbally and non-verbally guide the exploration of learners by directing attention, providing feedback, structuring experiences, supporting learning attempts, and regulating the complexity and difficulty of information to push learners a little beyond their current abilities in order to help them acquire new skills and concepts. In turn, learners tune their teachers' instruction and shape subsequent guidance by expressing their current understanding through demonstration and a rich variety of communicative acts. Through this interaction, learner and teacher form mental models of each other that they use to support the learning–teaching process as a richly collaborative activity.
It is actually very difficult to build robotic systems that can successfully learn in real time from the general public. Human teaching behaviour is highly variable and complex, and different people bring different styles of interaction to the table. Today, it is common practice for robots to be taught and evaluated by the same researchers who developed it. Not surprisingly, if the teacher has special technical expertise and knowledge of the underlying learning algorithms that the robot uses, this leads to a strongly machine-centric style of interaction that is neither natural nor intuitive to someone who lacks such expertise. In fact, although there exists quite substantial work in developing robots that learn from people, it is still uncommon to conduct human participant studies with members of the general public to assess the learning performance of a robot when taught by someone who is not an expert in robotics, machine learning or otherwise.
My research group is unusual for a robotics group, having conducted over a dozen controlled, in-lab human participant studies with hundreds of participants in order to gain greater qualitative and quantitative understanding on how people approach the task of teaching a socially responsive machine. Often, we begin an investigation with a human study to learn more details about how people teach each other. Then, computationally modelling this process allows us to identify and explore the use of a variety of social cues, expressive behaviours, skills and cognitive capabilities that support social learning in robots. In this way, we use social robots as a scientific tool for measuring and quantifying human behaviour in new ways. This in turn has allowed us to generate new findings and discover new knowledge that can even inform how people teach and learn from one another. Figure 2 contrasts (a) the traditional machine-centric approach with (b) our human-centric approach.
3. Challenges in building teachable robots
Applying these results, my group has developed and evaluated how these social behaviours and expressive capabilities enable robots to learn interactively with human participants, as well as how the same social skills address several key challenges in learning from natural human instruction. I highlight several challenges below together with research highlights of my research group's contributions towards their solution.
(a) Challenge 1
Robots face the situation that there is a fundamental mismatch in their social and communicative sophistication relative to humans. For effective learning, however, it is important that learners are slightly challenged to push themselves towards new abilities that are within reach, while avoiding situations where they are too overwhelmed to make sense of things. Fortunately, teachers and learners can work together to establish to a suitable level of difficulty and to regulate the complexity of the interaction to be suitable for both.
(i) Example: envelope displays
To address this challenge, our research has contributed evidence for the importance of paralinguistic communication cues in HRI, and how they can be used to successfully manage this imbalance in a natural and intuitive manner. Through HRI studies with our robot, Kismet, we found that humans readily entrain to a robot's non-verbal social cues (e.g. envelope displays that regulate the exchange of speaking turns in human conversation) to improve the efficiency and robustness of ‘conversational’ flow by intuitively slowing the rate of turn exchanges to a level that the robot can handle well. For instance, humans tend to make eye contact and raise their eyebrows when ready to relinquish their speaking turn, and tend to break gaze and blink when starting their speaking turn. When these same facial displays are implemented on a robot, we found that they are effective in smoothing and synchronizing the exchange of speaking turns with human subjects, resulting in fewer interruptions and awkward long pauses between turns (Breazeal 2003b).
(ii) Example: coordination behaviours
Through another series of HRI studies, we examined the use of a number of coordination behaviours where participants guided our robot, Leonardo, using speech and gesture to perform a physical task involving pressing a sequence of coloured buttons ON. Leonardo communicates through gaze (visual attention) and facial expressions (affective state) or explicitly through gestural cues (i.e. pointing). The robot's coordination behaviours include visually attending to the human's actions (e.g. pointing to or pressing a button) to acknowledge their contributions, issuing a short nod to acknowledge the success and completion of the task or subtask (i.e. turning the buttons ON), visually attending to the person's attention directing cues such as to where the human looks or points, looking back to the human once the robot presses a button to make sure its contribution is acknowledged, and pointing to buttons in the workspace to direct the human's attention towards them. Both self-report via questionnaire and behavioural analysis of video support the hypothesis that these non-verbal communication cues positively impact human–robot task performance with respect to understandability of the robot, efficiency of task performance and robustness to errors that arise from miscommunication (Breazeal et al. 2005b).
(iii) Example: emotive displays
In addition, we found that emotive expressions (as governed by the robot's emotion-based models) are interpreted by humans as natural analogues, and thereby can be used by the robot to regulate its interaction with the human—to keep the complexity of the interaction within the robot's perceptual limits and even to help the robot to achieve its goals (Breazeal & Scassellati 2000). Many of these results were first observed with our robot, Kismet, the first robot designed to explore socio-emotive face-to-face interactions with people explicitly (Breazeal 2002). Our research with Kismet was strongly inspired by the origins of social interaction and communication in people, namely that which occurs between carer and infant, through extensive computational modelling guided by insights from developmental psychology and behavioural models from ethology (Breazeal 2003a). It is well established that early infant–carer exchanges are grounded in the regulation of emotion and its expression.
Inspired by these interactions, Kismet's cognitive–affective architecture was designed to implement core proto-social responses exhibited by infants given their critical role in normal social development. Internally, Kismet's models of emotion interacted intimately with its cognitive systems to influence behaviour and goal arbitration. Through a process of behavioural homeostasis, these emotive responses served to restore the robot's internal affective state to a mildly aroused, slightly positive state—corresponding to a state of interest and engagement in people and its surroundings that fosters learning. One purpose of Kismet's emotive responses was to reflect the degree to which its drives and goals were being successfully met. A second purpose was to use emotive communication signals to regulate and negotiate its interactions with people. Specifically, Kismet utilized emotive displays to regulate the intensity of playful interactions with people, ensuring that the complexity of the perceptual stimulus was within a range the that robot could handle and potentially learn from. In effect, Kismet socially negotiated its interaction with people via its emotive responses to have humans help it achieve its goals, satiate its drives and maintain a suitable learning environment (Breazeal 2004a).
(iv) Summary: joint action
While more established approaches to instructing robots view the interaction as a one-way flow of information from human to machine, this body of work challenges the paradigm by illustrating the myriad of ways in which humans participate in the teaching–learning process as tightly coupled joint action. Humans do not simply provide training inputs as a one-sided interaction to which the learner must react. Rather, people are constantly reading and interpreting numerous behavioural cues of the robot as indicators of its internal state, and are continually adapting and tuning their teaching behaviour to be suitable for the robot learner.
This interaction dynamic has significant implications for the design of robots that learn from people. The robot is not restricted to learning in a complex environment that does not care whether the robot succeeds or fails—a common assumption in robot learning systems. Rather, people view teaching and learning as a partnership with shared goals. Because of this, the robot can proactively improve the quality of its learning environment, tuning the teaching acts of the human to be more suitable, through using communication acts that reveal its learning process to the human teacher.
(b) Challenge 2
Faced with an incoming stream of sensory data, a robot must figure out which of its myriad of perceptions are relevant to the task at hand. This is an important capability for generating coherent behaviour as well as for learning given that the search over state space becomes enormous as perceptual abilities and complexity of the environment increase.
(i) Example: saliency and shared attention
To address this challenge we have identified a set of socially embodied cues and socio-cognitive abilities that assist the robot's determination of saliency when learning a task. These cues and abilities make the robot's underlying attention mechanisms responsive to a human teacher's efforts to highlight a distinct environmental context or change that is relevant to the learning task.
In a series of human studies we have identified a growing set of social cues and socio-cognitive skills that play an effective role in addressing the saliency question.
For instance, we have implemented a multi-modal attention system to enable the robot to leverage the human teacher's desire to direct its visual attention by following the human's pointing gestures or gaze (estimated by head pose). To compute our robot's attentional focus, the attentional system computes the level of saliency (a measure of ‘interest’) per feature channel for objects and events in the robot's perceivable space (Breazeal & Scassellati 1999; Breazeal et al. 2000). For Leonardo, the contributing factors to an object's overall saliency fall into three categories: its perceptual properties (i.e. its proximity to the robot, its colour, whether it is moving, etc.), the internal state of the robot (i.e. whether this is a familiar object, what the robot is currently searching for and other goals) and social reference (if something is pointed to, looked at, talked about or is the referential focus). For each item in the perceivable space, the overall saliency at each time step is the result of the weighted sum for each of these factors. The item with the highest saliency becomes the current attentional focus of the robot, and also determines where the robot's gaze is directed. The gaze direction of the robot is an important communication device to the human, verifying for the human partner what the robot is attending to and thinking about.
The human's attentional focus is determined by what he or she is currently looking at. Leonardo calculates this using the head pose tracking data, assuming that the person's head orientation is a good estimate of his or her gaze direction. By following the person's gaze, the shared attention system determines which (if any) object is the attentional focus of the human's gaze. The mechanism by which infants track the referential focus of communication is still an open question, but a number of sources indicate that looking time is a key factor, such as word learning studies. For example, when a child is playing with one object and hears an adult say ‘It's a modi’, the child does not attach the label to the object the child happens to be looking at (which is often the adult's face!). Instead the child redirects his or her attention to look at what the adult is looking at, and attaches the label to that object. For our robot, we use a simple voting mechanism to track a relative-looking-time for each of the objects in the robot's and human's shared environment. The object with the highest accumulated relative-looking-time is identified as the referent of the communication between the human and the robot (Thomaz et al. 2005).
Using these models, we have found that active monitoring of shared visual attention between the human teacher and the robot learner is important in order to achieve robustness in the learning interaction. In a series of human participant studies where human teachers guide a robot to perform a simple task (learning to operate a control panel with a lever, toggle and button), we have found that humans readily coordinate their teaching behaviour with the robot's gaze behaviour—waiting until the robot re-establishes eye contact before offering their next guidance cue, adaptively re-orienting their guidance cue to be in alignment with the robot's current visual focus, actively trying to re-direct the robot's gaze through deictic cues or offering more guidance if the robot's gaze behaviour conveys uncertainty in what to do next (e.g. looking back and forth among several possible alternatives) (Breazeal & Thomaz 2008a; Thomaz & Breazeal 2008). These findings suggest that people read the robot's gaze as an indicator of its internal state of attention as well as solicitations for help, and intuitively coordinate their teaching acts to support the robot's learning process.
(ii) Example: perspective taking
In another series of human and HRI studies, we identified, verified and evaluated mental perspective taking as an important socio-cognitive skill that helps either human or robot learners to focus attention on the subset of the problem space that is important to the teacher by actively considering the teacher's experience such as visual perspective, attentional focus or resource considerations (Berlin et al. 2006). This constrained attention enables the robot learner to overcome the ambiguity and incompleteness that is often present in human demonstrations.
To endow Leonardo with perspective taking abilities, our cognitive–affective architecture incorporates simulation-theoretic mechanisms as a foundational and organizational principle. Simulation theory holds that certain parts of the brain have dual use; they are used not only to generate our own behaviour and mental states, but also to predict and infer the same in others. To try to recognize or infer another person's mental process, the robot uses its own cognitive processes and body structure to simulate the mental states of the other person—in effect, taking the mental perspective of another.
In figure 3, the two concentric bands denote two different modes of operation. In the generation mode (the light band) the robot constructs its own mental states to behave intelligently in the world. In the simulation mode (the dark band) the robot constructs and represents the mental states of its human collaborator based on observing his or her behaviour and taking their mental perspective. By doing so, the mental states of the human and the robot are represented in the same terms so that they can be readily compared and related to one another. For instance, within the perception system, the robot performs a transformation to estimate what the human partner can see from his or her vantage point. Within the motor system, mirror-neuron inspired mechanisms are used to map and represent perceived body positions of the human into the robot's own joint space to conduct action recognition. Within the belief system, belief construction is used in conjunction with adopting the visual perspective of the human partner in order to estimate the beliefs the human is likely to hold given what he or she can visually observe. Finally, within the intention system where goal-directed behaviours are generated, schemas relate preconditions and actions with desired outcomes and are organized to represent hierarchical tasks. Within this system, motor information is used along with perceptual and other contextual clues (i.e. task knowledge) to infer the human's goals and how he or she might be trying to achieve them (i.e. plan recognition).
In a learning situation, the robot can take the perspective of the teacher in order to model the task from their perspective. In effect, the robot runs a parallel copy of its task-learning engine that operates on its simulated representation of the human's beliefs. In essence, this focuses the hypothesis generation mechanism on the subset of the input space that matters to the human teacher. This enables the robot to learn what the teacher intends to teach even if the demonstrations are ambiguous.
To investigate this, we conducted a human participant study where the participants were asked to engage in four different learning tasks involving foam building blocks. We gathered data from 41 participants, divided into two groups: 20 participants observed demonstrations provided by a human teacher sitting opposite them (the social condition), while 21 participants were shown static images of the same demonstrations, with the teacher absent from the scene (the non-social condition). Participants were asked to show their understanding of the presented skill either by re-performing the skill on a novel set of blocks (in the social context) or by selecting the best matching image from a set of possible images (in the non-social context). Figure 4 (left) illustrates sample demonstrations of each of the four tasks. The tasks were designed to be highly ambiguous, providing the opportunity to investigate how different types of perspective taking might be used to resolve these ambiguities. The subjects' demonstrated rules can be divided into three categories: perspective taking (PT) rules, non-perspective taking (NPT) rules and rules that did not clearly support either hypothesis (other). For instance, task 1 focused on visual perspective taking during the demonstration. Participants were shown two demonstrations with blocks in different configurations. In both demonstrations, the teacher attempted to fill all of the holes in the square blocks with the available pegs. Critically, in both demonstrations, a blue block lay within clear view of the participant but was occluded from the view of the teacher by a barrier. The hole of this blue block was never filled by the teacher. Thus, an appropriate (NPT) rule might be ‘fill all but blue’ or ‘fill all but this one,’ but if the teacher's perspective is taken into account, a more parsimonious (PT) rule might be ‘fill all of the holes’ (see figure 4).
The tasks from our human study were used to create a benchmark suite for our architecture. In our simulation environment, the robot was presented with the same task demonstrations as were provided to the study participants. The learning performance of the robot was analysed in two conditions: with the perspective taking mechanisms intact and with them disabled. Table 1 (left) shows the hypotheses entertained by the robot in the various task conditions at the conclusion of the demonstrations. The hypotheses favoured by the learning mechanism are highlighted in italic. For comparison, table 1 (right) displays the rules selected by study participants, with the most popular rules for each task highlighted in italic. For every task and condition, the rule learned by the robot matches the most popular rule selected by humans.
These results support our hypothesis that the robot's perspective taking mechanisms focus its attention on a region of the input space similar to that attended to by study participants in the presence of a human teacher. It should also be noted, as evident in table 1, that participants generally seemed to entertain a more varied set of hypotheses than the robot. In particular, participants often demonstrated rules based on spatial or numeric relationships between the objects—relationships that are currently not yet represented by the robot. Thus, the differences in behaviour between humans and the robot can largely be understood as a difference in the scope of the relationships considered between the objects in the example space rather than as a difference in this underlying space. The robot's perspective taking mechanisms seem to be successful at bringing the robot's focus of attention into alignment with the humans' focus of attention in the presence of a social teacher.
(iii) Example: spatial scaffolding
In other human participant and HRI experiments, we have identified, verified and evaluated a set of simple, prevalent and highly reliable spatial scaffolding cues by which human teachers interactively structure and organize the physical workspace to help direct the attention of the learner (e.g. moving objects nearer or farther from the learner's body to signify their relevance) (Breazeal & Berlin 2008).
For example, we designed a set of tasks to examine how teachers emphasize and de-emphasize objects in a learning environment with their bodies, and how this emphasis and de-emphasis guides the exploration of a learner and ultimately the learning that occurs. In our human study, we gathered data from 72 individual participants, combined into 36 pairs. For each pair, one participant was randomly assigned to play the role of teacher and the other participant was assigned the role of learner for the duration of the study. For all the tasks, participants were asked not to talk, but were told that they could communicate in any way other than speech. The teacher and learner stood on opposite sides of a tall table, with 24 colourful foam building blocks (four different colours and six different shapes) arranged between them on the tabletop. The study tasks were interactive ‘secret constraint’ tasks where one person (the learner) knows the task goal (construct a tangram-like figure out of the blocks) but does not know that there is a secret constraint to accomplish the task successfully. The other person (the teacher) does not know the task goal (the figure) but knows the constraint (e.g. ‘the figure must be constructed using only blue and red blocks, and no other blocks.’). Hence, both people must work together to complete the task successfully.
To record high-resolution data of the study interactions, we developed a data-gathering system that incorporated multiple, synchronized streams of information about the study participants and their environment. For all the tasks, we tracked the positions and orientations of the heads and hands of both participants, recorded video of both participants and tracked all the objects with which the participants interacted such as the positions and orientations of all the foam blocks. To identify the emphasis and de-emphasis cues provided by the teachers in these tasks, an important piece of ‘ground-truth’ information was exploited: for these tasks, some of the blocks were ‘good’ and others were ‘bad.’ In order to complete the task successfully, the teacher needed to encourage the learner to use some of the blocks in the construction of the figure and to steer clear of some of the other blocks.
We observed a wide range of embodied cues provided by the teachers in the interactions for these two tasks as well as a range of different teaching styles. Positive emphasis cues included simple hand gestures such as tapping, touching and pointing at blocks with the index finger. These cues were often accompanied by gaze targeting, or looking back and forth between the learner and the target blocks. Other positive gestures included head nodding, the ‘thumbs up’ gesture and even shrugging. Teachers nodded in accompaniment to their own pointing gestures, and also in response to actions taken by the learners. Negative cues included covering up blocks, holding blocks in place or maintaining prolonged contact despite the proximity of the learner's hands. Teachers would occasionally interrupt reaching motions directly by blocking the trajectory of the motion or even by touching or (rarely) lightly slapping the learner's hand. Other negative gestures included head shaking, finger or hand wagging, or the ‘thumbs down’ gesture.
However, by far the most important set of cues used related to block movement and the use of space. To emphasize blocks positively, teachers would move them towards the learner's body or hands, towards the centre of the table, or align them along the edge of the table closest to the learner. Conversely, to emphasize blocks negatively, teachers would move them away from the learner, away from the centre of the table, or line them up along the edge of the table closest to themselves. Teachers often devoted significant attention to clustering the blocks on the table, spatially grouping the bad blocks with other bad blocks and the good blocks with other good blocks. These spatial scaffolding cues were the most prevalent cues in the observed interactions (Breazeal & Berlin 2008).
To verify the prevalence and usefulness of these spatial scaffolding cues for a robot, we substituted our robot Leonardo for the role of the learner (Berlin et al. 2008). The robot's attention system was designed to pay attention to block movement towards and away from its body. In order to give the robot the ability to learn from these embodied cues, we developed a simple, Bayesian learning algorithm. The algorithm was designed to learn rules pertaining to the colour and shape of the foam blocks and maintained a set of classification functions that tracked the relative odds that the various block attributes were ‘good’ or ‘bad’ according to the teacher's secret constraints. Each time the robot observed a salient teaching cue, these classification functions were updated using the posterior probabilities presented in the previous section—the odds of the target block being ‘good’ or ‘bad’ given the observed cue. For example, if the teacher moved a green triangle away from the robot, the relative odds of green and triangular being good block attributes would decrease. Similarly, if the teacher then moved a red triangle towards the robot, the odds of red and triangular being good would increase.
These simple spatial scaffolding cues proved to be highly effective. We invited 18 participants to teach Leonardo the identical secret constraint tasks as our human learners. The robot successfully learned the task in 33 of the 36 interactions (92%). These results support the conclusion that the spatial scaffolding cues observed in human–human teaching interactions do indeed transfer to HRIs, and can be effectively taken advantage of by robot learners (Berlin et al. 2008).
(iv) Summary: social filters
Whereas traditional approaches to teaching robots do not model social–cognitive skills and abilities as integral to the learning process, this body of work has identified and verified a number of ways that internal and external social factors play an important role in how a robot learner filters the incoming perceptual stream to attend to what matters, that human teachers bring many of these same social cues and skills to bear when teaching either humans or robots, and that these ‘social filters’ can be effectively used by a robot to help it identify the most relevant items to consider, thereby making the learning problem significantly more manageable.
(c) Challenge 3
Once the robot has identified salient aspects of the scene, how does it determine what actions it should take? If the robot had a way of focusing on potentially successful actions, its exploration would be more effective. This can be addressed in a number of ways, such as by experimenting on its own as in reinforcement learning (RL). However, for large state–action spaces this typically requires a prohibitively large number of trials.
(i) Example: tutelage-style interaction
To address this issue, we have explored how social skills such as turn-taking enable a human teacher to play an important and flexible role in guiding the robot's exploration. This focuses the robot's selection of the most promising actions in specific contexts to discover solutions more quickly. By participating in a ‘dialogue’ of demonstration followed by feedback and refinement, the human helps the robot to determine what action to try through a communicative and iterative process. We evaluated this approach by comparing it to learning the same task using traditional RL and achieved significant improvements in efficiency without loss of accuracy and with decreased sensitivity to noise (e.g. errors introduced by miscommunication are quickly repaired, which leads to greater robustness) (Breazeal et al. 2004).
(ii) Example: socially guided exploration
Unfortunately, a common limitation of human teachable robots is that the robot only learns when being explicitly taught. Personal robots, however, will need to learn while ‘on the job’ even when a person is not present or willing to teach it. To address this, we have developed and evaluated a learning system whereby learning opportunities for the robot's hierarchical RL mechanism arise from a combination of intrinsically motivated self-exploration and social scaffolding provided by a human teacher, such as suggesting actions for the robot to try, drawing the robot's attention to relevant contexts and highlighting interesting outcomes (Breazeal & Thomaz 2008b). We have systematically identified and verified our set of social scaffolding mechanisms through a series of HRI studies where a human teacher guides Leonardo's exploration as it discovers a set of behaviours (e.g. opening or closing, playing music, changing colors of lights) of a ‘smart box’ through pressing buttons, pushing levers and sliding toggles. Over time, Leonardo learns a set of task policies for bringing about each of these behaviours from different starting conditions to ‘master’ the ‘smart box’. We analysed the learning performance of the robot both with and without human teachers and found that learning performance via self-exploration is slower but more serendipitous resulting in a broader task suite, whereas learning with a human teacher makes learning more efficient and robust but tends to result in a smaller, more specialized task suite that reflects what the person wanted the robot to learn (Breazeal & Thomaz 2008a,b).
(iii) Summary: intrinsically motivated but guidable learning
Personal robots will need to adapt their learning style to suit the dynamics of a changing learning environment. Sometimes the robot will have to explore on its own, while at other times a teacher might be present to help guide the robot's exploration. Through our studies, we have found that each style of learning has its respective advantages and produces learning products that are synergistic. For instance, what is learned more slowly but serendipitously through intrinsically motivated exploration yields a broader task suite that can come in handy at a later date—especially when the robot encounters a human teacher who helps the robot to rapidly hone and build on its growing skill set through socially guided exploration. Importantly, the mechanisms by which the robot's learning can be guided by the human should be informed by how people are naturally inclined to teach robots.
(d) Challenge 4
Once the robot attempts to perform an action, how can it determine whether it has been successful? How does it assign credit for that success? Further, if the robot has been unsuccessful, how does it determine which parts of its performance were inadequate? It is important that the robot be able to diagnose its errors in order to improve performance.
(i) Example: multi-modal feedback
To address this challenge, our approach recognizes that the teacher can readily help the robot do this given that he or she has a good understanding of the task and knows how to evaluate the robot's success and progress. One way in which a human facilitates a learner's evaluation process is by providing feedback through various communication channels. For instance, we demonstrated the capability of a robot to interpret and appropriately respond to the affective intent in human speech, such as praising or scolding tones of voice (Breazeal & Aryananda 2002). In HRI studies we showed that people refer to the robot's expressive cues to confirm that the robot understood them as well as the strength of the affective intent. We have applied verbal feedback in teaching scenarios to help the robot correct its task model as soon as mistakes are made. Furthermore, the robot provides the human with communicative feedback so that misunderstandings can be detected quickly. Both forms of feedback help to prevent errors from persisting for multiple steps, which could make them more awkward to correct later on. In recent HRI studies, our data suggest that these various forms of feedback contribute to a more fluid, efficient, accurate and robust teaching/learning interaction (Breazeal et al. 2004; Breazeal & Thomaz 2008a,b).
(ii) Example: guidance and understanding intent
Note that for any given feedback channel, it is important to understand what people are trying to communicate through it and how they are trying to make use of it. Our HRI studies with an interactive RL agent revealed that people use the reward signal not only to provide feedback on past actions (what is commonly assumed in the design of RL algorithms) but also to guide future action (Thomaz & Breazeal 2008). Further, we discovered a strong bias of positive over negative feedback over the entire duration of the training, even in the beginning when the agent was doing many things wrong (Thomaz & Breazeal 2008). This suggests that people were using the feedback channel to motivate and encourage the robot. In short, people were naturally inclined to use the reward signal in many ways that the traditional RL framework was not designed to handle. Given our findings, we were then able to adapt the RL agent algorithm and teaching interface to accommodate how and what people were trying to communicate to the learner. As a result, our modified RL agent learned much more efficiently and robustly in a subsequent series of HRI experiments (Thomaz & Breazeal 2008).
(iii) Summary: transparency
While traditional approaches to robot training do not consider how a robot can proactively communicate and reveal its learning process to the human teacher, the findings generated by this body of work argue for the importance of transparency in designing interactive robot learners. People are willing and able to help robots address the difficult task of assigning value to its past actions. People are also willing to help guide the robot to select good future actions, to motivate the robot and more. However, human teachers cannot do this well if they lack a good mental model of the robot's learning process or if they are not provided with the right set of communication channels. The robot's behaviour, both its expressive cues and instrumental actions, can play a significant role in shaping the mental model that the human has for the robot. These readily observable expressive and performance-based cues make the robot's learning process transparent to the teacher. Much of our work to date has emphasized the role of the robot's non-verbal cues, such as facial expressions, gestures and use of gaze, in supporting this process. And conversely, our HRI studies have helped us to identify what kinds of intents people want to communicate to the robot through both verbal and non-verbal channels—to help the robot learn by influencing its evaluation process and more.
While it might be tempting to compare our outcomes with those of statistical machine learning techniques, my research vision and the challenges I wish to solve are ultimately different. My students and I have built and evaluated autonomous robotic systems that are able to leverage from the interplay of social guidance with statistical inference algorithms to learn new tasks and concepts from humans from natural social interactions. For task learning, our robots are able to quickly infer the critical preconditions and desired outcome for each step of the learned task, as well as how these steps relate to one another in the overall task structure, with improved efficiency and robustness to noise without loss of accuracy over traditional statistical machine learning methods (e.g. traditional RL). For concept learning, our robots are able to learn the correct concept from natural interactions by exploiting natural scaffolding cues such as how the teacher uses space to highlight the concept to be learned, or by applying socio-cognitive skills to consider the teachers' perspective in order to learn the appropriate concept in the face of ambiguous demonstrations. The underlying machine learning algorithm can be simple because the robot appropriately leverages the social structure inherent in the teacher's behaviour or the modified workspace to attend to what matters and learn the right thing. Furthermore, the same social cues can be repurposed to support other social capabilities such as multi-modal communication and our research on human–robot teamwork.
To conclude, the field of social robotics is very young but growing rapidly—motivated by the vision of personal robots that help anyone in their daily activities. My dream is to enable machines to engage in the powerful, social forms of interaction, collaboration, understanding and learning that people readily participate in. This vision is motivated by the observation that humans are ready-made experts in social interaction; the challenge is to design robots to participate in what comes naturally to people. By doing so, socially interactive robots could help a wide demographic of people in a broad range of applications and real-world challenges from health, therapy, education, communication, security, entertainment, or physical assistance. In this article, I have tried to illustrate the myriad of ways in which designing social robots that successfully interact with and learn from ordinary people presents new challenges and opportunities, and have highlighted some of the key lessons and findings learned along the way. We live in an exciting time where so much is possible at the intersection of science and technology. Social robots promise to be not only helpful to us in the future but also a lot of fun. And in the process of building them, we may learn even more about ourselves.
One contribution of 17 to a Discussion Meeting Issue ‘Computation of emotions in man and machines’.
- © 2009 The Royal Society