AI-HRI 2019 — 2019 AAAI Fall Symposium on The Artificial Intelligence for Human-Robot Interaction (AI-HRI)

Towards A Robot Explanation System: A Survey and Our Approach to State Summarization, Storage and Querying, and Human Interface

Zhao Han, Jordan Allspaw, Adam Norton and Holly Yanco

Full paper, 8 pages
diagram of explanation system
The workflow of the robot explanation system.
  • Aug 19, 2019

    Our paper is accepted to The AAAI Fall Symposium on The Artificial Intelligence for Human-Robot Interaction (AI-HRI '19)! This is an important step towards a robot explanation system. We have already started implementing it.


As robot systems become more ubiquitous, developing understandable robot systems becomes increasingly important in order to build trust. In this paper, we present an approach to developing a holistic robot explanation system, which consists of three interconnected components: state summarization, storage and querying, and human interface. To find trends towards and gaps in the development of such an integrated system, a literature review was performed and categorized around those three components, with a focus on robotics applications. After the review of each component, we discuss our proposed approach for robot explanation. Finally, we summarize the system as a whole and review its functionality.


With the advancement and wide adoption of deep learning techniques, explainability of software systems and interpretability of machine learning models has attracted both human-computer interaction (HCI) researchers (e.g., (Abdul et al. 2018)) and the artificial intelligence (AI) community (e.g., (Miller 2019)). Work in human-robot interaction (HRI) has shown that improving understanding of a robot makes it more trustworthy (Desai et al. 2013) and more efficient (Admoni et al. 2016). However, how robots can explain themselves at a holistic level (i.e., generate explanations and communicate them, with a supporting data storage system with efficient querying) remains an open research question.

As opposed to virtual AI agents or computer software, robots have physical embodiment, which influences metrics such as empathy (Seo et al. 2015) and cooperation (Bainbridge et al. 2008) with humans. Given this embodiment, some research in human-agent interaction is not applicable to human-robot interaction. For example, in a literature review about explainable agents and robots (Anjomshoae et al. 2019), approximately half (47%) of the explanation systems examined used text-based communication methods, which is less relevant for robots that are not usually equipped with display screens. Instead, HRI researchers have been exploring non-verbal physical behavior such as arm movement (Dragan, Lee, and Srinivasa 2013Kwon, Huang, and Dragan 2018) and eye gaze (Moon et al. 2014). Non-verbal behaviors can help people to anticipate a robot’s actions (Lasota et al. 2017), but understanding why that behavior occurred can improve one’s prediction of behaviors, especially if the behavior is opaque (Malle 2006). Thus, robot explanations of their own behavior are needed.

In this paper, aholistic robot explanation system is decomposed into three components (see Figure 1), the research literature is surveyed to explore trends and gaps, and a proposed designed philosophy and approach is detailed as informed by the results of the literature review. We aim to provide important considerations and directions towards a robot explanation system that can accelerate robot acceptance. The term “summarization” addresses the process of shortening the description of the robot’s activities while “explanation” strives to give insight into why the robot performed the summarized behaviors.

1.1 System Components

High level representation of the robot explanation systems three components
Figure 1:High-level representation of the robot explanation system’s three components. Fig. 2 shows a detailed version.

A robot explanation system requires state summarization, data storage and querying, and a human interface.

State summarization is at the core of the system, manually or automatically generating varying levels of summaries from different robot states while performing tasks or from the stored states in a post-hoc fashion. The varying summary levels allow people to receive explanations ranging from more abstract to more detailed (Brooks et al. 2010) (e.g., processed data compared to raw sensor data, respectively). Explanations that utilize raw sensor data will likely only be favored by the minority of expert users while explanations involving processed data will be useful for both expert users and the majority of non-expert users.

A persistent storage system is needed to retain robot data and generated explanations. This system will pass the generated explanations, or summaries, to the human interface to be communicated. While storing them, different levels of explanations stemming from the same instance of source data need to be linked to maintain fluid interactions with a person. The person may request follow up explanations with more or less detail than the initial explanation. The storage system must also have a query component as part of the database interface to support online state summarization, which is needed when the stored summaries are not sufficient to answer users’ questions. Querying must also be efficient, given the potentially large amount of robot data being stored.

The human interface component communicates the explanations from the robot to the human and allows the person to ask the system questions. The human interface can use several different modalities, such as natural language dialogues, a traditional graphic user interface (GUI) on a display screen or virtual reality (VR), and augmented reality (AR) that directly projects onto the robot’s environment. The communication method could also involve moving the robot system, such as moving the robot’s head or arm.

1.2 Scope and Contributions of the Work

This paper surveys the literature about the three components of an explanation system to provide critique, summarize trends, and discover gaps towards the design of a robot explanation system. By leveraging the trends observed in the research literature, we propose a robot explanation system architecture that will aim to fill the discovered gaps.

The literature review was focused on research involving robots, rather than more general AI agents or the interpretation of machine learning models. This constraint was relaxed when work with robots was underrepresented or shared commonalities in one component, such as state summarization where abstract states both apply to AI agents and robots.

Thus, the literature review is not meant to be exhaustive, but comprehensive under these constraints to cover the three components in the context of physical co-located robot systems. We refer interested readers to (Abdul et al. 2018) for explainability of computer software, (Miller 2019Adadi and Berrada 2018) for the explainability of virtual AI agents, and (Zhang and Zhu 2018Guidotti et al. 2018) for interpretability of machine learning/deep learning models.

State Summarization

Before a robot can begin to explain its actions, it must first translate its decisions in a manner that could be understood by a human. Significant research has been performed within the fields of HCI and HRI towards this goal. In this section, we discuss systems in the literature that are deployable to physical robots. There has been significant research in the field of explainable AI, sometimes referred to as XAI (Wang et al. 2017); however, this is beyond our scope. We are specifically interested in summarization methods that can work for a variety of different systems.

2.1 Manual Methods

There are two components of state summarization: the state of which the robot is aware and the state that the robot communicates to the user. A common approach is for developer to manually create categories by which the robot can explain its actions. For example, programmer specified function annotations for each designated robot action are used in (Hayes and Shah 2017). By creating a set of robot actions, correlated with code functions, the system is able to snapshot the state of the robot before and after a function is called. Since the state of the robot could be exceedingly large in a real world, deployed system, the state space is shrunk by isolating which variables are predetermined to be most relevant. These annotated variables are recorded every time a pre- and post-action snapshot is made. The robot then uses inspection to compare the pre- and post-variables of one action, compared to other similar successful actions, to make judgments.

A different approach, suggested in (Kaptein et al. 2017), is to adapt hierarchical task analysis (Schraagen, Chipman, and Shalin 2000) to a goal hierarchy tree (GHT). This involves creating a tree where the top node would be a high level task, which can be broken into a number of sub-goals, each linked by a belief (i.e., condition). Each sub-goal can then be broken into either sub-goals or actions. Choosing one sub-goal or action over another is based on a belief. The GHT can then be used to generate explanations. When comparing goal based vs. belief based explanations, Kaptein et al. found that adults significantly preferred goal based explanations.

2.2 Summarization Algorithms

While manually creating categories or explanations can be effective, it is time consuming and not easily generalizable. Many techniques attempt to automate the process.

Programmer supplied explanations might be able to accurately describe the state of a robot, however, they can prove to be inadequate for a user. Ehsan et al. state that it is best to use a rationale justification (2019) to explain to non-expert users, differentiating between a rationale and an explanation. An explanation can be made by exposing the inner workings of a system, but this type of explanation may not be understandable from non-experts. They suggest the alternative, a rationale, is meant to be an accessible and intuitive way of describing what the robot is doing. They also discuss how explanations can be tailored to optimize for different factors, including relatability, intelligibility, contextual accuracy, awareness and strategic detail; these factors can affect the user’s confidence, understandability of the explanations, and how human-like explanation was. The approach does not attempt to provide an explanation that reveals the underlying algorithm, but rather attempts to justify an action based on how a non-developer bystander would think. The authors explore two different explanation strategies: “focused view rationale” provides concise and localized rationale, which is more intelligible, and easier to understand, whereas “complete view rationale” provides detailed and holistic rationale, which has better strategic detail and increased awareness.

Haidarian et al. proposed a metacognitive loop (MCL) architecture with a generalized metacognition module that monitors and controls the performance of the system (2010). Every decision performed by the system has a set of expectations and a set of corrections or corrective responses. Their framework does not attempt to monitor and respond to specific expectation failures which would require intricate knowledge of how the world works. However, the abandonment of intricate knowledge makes it difficult to provide specialized, highly detailed explanations to an expert operator.

Most of this prior work examined explanations within rule-based and logic-based AI systems, not addressing the quantitative nature of much of the AI used in HRI. More recent work on automatic explanations instead used Partially Observable Markov Decision Problems (POMDPs) which have seen success in several situations within robotics (Wang, Pynadath, and Hill 2016). Unfortunately, the quantitative nature of these models and the complexity of their solution algorithms also makes POMDP-reasoning opaque to people. Wang, Pynadath, and Hill propose an approach to automatically generate natural-language explanations for POMDP-based reasoning, with predefined string representations of the potential actions, accompanied by the level of uncertainty, and the relative likelihood of outcomes. The system could also reveal information about its sensing abilities along with how accurate its sensor is likely to be. However, modeling using POMPDs can be time consuming.

Miller discusses how explanations delivered to the user should be generated based on data from social and behavioral research, which could increase user understandability (2019). Whether the explanation is generated from expert developers or from a large dataset of novice operators, both cases still require manually tying the robot algorithm to an explanation, a process that can be difficult and faulty.

In the literature review by Anjomshoae et al. , they conclude that context-awareness and personalization remain under-researched despite having been determined to be key factors in explainable agency (2019). They also suggest that multi-model explanation presentation is possibly useful, which would mean your underlying state representation would need to be robust enough to handle several different approaches. Finally, they propose that a robot should keep track of a user’s knowledge, with the explanation generation model updated to reflect the evolution of user expertise.

2.3 Proposed Approach

Much of the prior research focuses on scenarios where the system is designed for either novice or expert users but not both. De Graaf and Malle argue that a robot should take a person’s knowledge and role into account when formulating a response (2017). While context-awareness and personalization have been outlined as key factors for effective state summarization, there is little research where the robot adjusts its explanation depending on the user (Anjomshoae et al. 2019), even in simple cases such as expert vs. novice user.

There are many identified differences between expert users and novice users. For example, traceability and verification are very important for software and hardware engineers (Cleland-Huang et al. 2012) while explainability or intelligibility are particularly important for laymen (De Graaf and Malle 2017). Cases where a user starts as a novice then gains experience over time are under-researched. In our system, the state representation and explanation generation need to be able to adjust to the user and possibly change depending on context. In addition, even if the user is taken into account, such situations fail to account for cases where a user could potentially want to have both levels of explanations available simultaneously or to switch between. For example, The user could quickly receive a high level explanation for why the robot performed an action, then if that proves insufficient, inquire for more details. Ideally, the robot would be able to provide explanations with varying granularity and context, tailored to the experience level of the user (e.g., bystander, operator, programmer, etc.).

Given the conclusion that participants preferred annotations of the actions by a reinforcement learning game agent from developers (Ehsan et al. 2019), both manual and automated methods to generate explanations should be considered. Manually generated explanations fill the gap that new users do not have a thorough understanding of the logic in the underlying algorithm. However, when developers manually put explanations in code (e.g., using (Hayes and Shah 2017)), one should always consider the new users audience and provide easy-to-understand explanations that are not tightly coupled with implementation details.

In order to cover a wide variety of possible situations, our proposed approach is for an expertly created inner state representation based on categories and goals with methods to automatically create desirable explanations delivered to the user, where those explanations can adjust based on the user’s experience (bystander, novice, expert) as well as other factors. The system should isolate and convey necessary context for a decision or state, or be prepared to provide it if additional information is requested. Specifically, if confidence in these generated responses in low, the state summarization algorithms can fall back to the expertly created explanations.

For example, the sample task of picking up an object could be decomposed into subtasks: locate the object, navigate to it, then grasp the located object. Each of those subtasks can then be broken down to subtasks of their own. The last step of grasping can be decomposed to reaching, grabbing, and retreating arm back to home location. Eventually the subtasks end up in robot primitives, the simplest actions the system can describe. Each of these actions can have a failure reason, along with context, which would include sensor data and prior relevant state information. If the motion planner fails to find a valid solution, the error propagates up and the action of “grasping an object” failed because the subtask “grabbing the object” failed when “reach for the object” failed as a result of no valid inverse kinematics solution being found. The system needs to correctly determine the most relevant and useful failure level to report; for example, in this case, an expert operator would be told, “No valid inverse kinematics solution was found” while a bystander would be told “I could not reach the object.”

Storage and Querying

Terminal output or logs are common methods for debugging during active development and for error analysis after a robot has been deployed, but both methods have some drawbacks. Terminal output is essentially volatile memory, lost after the terminal window is closed, disallowing retrospection. However, despite being persistent on disks, software logs are unstructured and unlinked between related data, which makes it hard to effectively and efficiently query. Thus, researchers have been exploring database techniques to better store and query robotic data. Because storage and querying are under-discussed in the robotic community, this section is more detailed than the other two components.

3.1 Storing Unprocessed Data

Many researchers have been leveraging the schemaless MongoDB database to store unprocessed data from sensors or communication messages from lower-level middleware such as motion planners (Niemueller, Lakemeyer, and Srinivasa 2012Beetz, Tenorth, and Winkler 2015). Being schemaless allows for recording different hierarchical data messages without declaring the hierarchy in the database (i.e., tables in relational databases such as MySQL). One such hierarchical example is the popular Pose message type present in the Robot Operating System (ROS) framework (Quigley et al. 2009). A Pose message contains a position Point message and an orientation Quaternion message; a Point message contains float values xy, and z; an orientation message is represented by xyz, and w. It is imaginable to go through the cumbersome process of creating tables of Pose, Point, and Quaternion. Even more tables have to be created for each hierarchical data message. This advantage is also described as minimal configuration and allows evolving data structures to support innovation and development (Niemueller, Lakemeyer, and Srinivasa 2012).

Niemueller, Lakemeyer and Srinivasa open-sourced the mongodb_log library and are among the first to introduce MongoDB to robotics for logging purposes, which has applications to fault analysis and performance evaluation (Niemueller, Lakemeyer, and Srinivasa 2012). In addition to being schemaless, the features that support scalability, such as capped collections, indexing and replication, are discussed. Capped collections handles limited storage capability by replacing old records with new ones. Indexing on a field or a combination of fields speeds up querying. Replication allows storing data across computers using the distributed pragma. Note that the indexing and replication features are also supported by relational databases.

While low-level data is needed, recording all raw data will soon hit the storage capacity limit: when old data is replaced by new records, the important information in the old data will be lost. This is particularly true when the data comes at a high rate; e.g., a HERB robot generates 0.1 GB per minute typically and 0.5 GB at peak times (Niemueller, Lakemeyer, and Srinivasa 2012). A more effective way is to be selective, only storing the data of interest (Oliveira et al. 2014). However, storing raw sensor data only facilitates debugging for developers; it does not solve the high-level explanation storage that will help non-expert users to understand the robot.

In addition, while it might be appropriate to expose the database to developers, a more effective way may be an interface that hides the database complexity, easing the cognitive burden on developers. This could be programming language agnostic, for example, by having a HTTP REST API.

Other researchers have also used MongoDB to store low-level data (Beetz, Mösenlechner, and Tenorth 2010Niemueller et al. 2013Winkler et al. 2014Balint-Benczédi et al. 2017) except for Oliveira et al. (2014) who used LevelDB, a key-value database for perceived object data. Ravichandran et al. benchmarked major types of databases and found on average MongoDB has the best performance to continuous robotic data (2018). However, time-series and key-value databases are not included in the benchmark.

3.2 Storing Processed Data

Instead of looking for related data using the universal time range, Balint-Benczédi et al. proposed Common Analysis Structure to store linked data for manipulation tasks (2017). The structure includes timestamp, scene, image, and camera information. A scene has a viewpoint coordinate frame, annotations, and object hypotheses. Annotations are supporting planes or a semantic location, and object hypotheses are regions of raw data and their respective annotations. The authors considered storage space constraints, thus filtering and storing only regions of interest in unblurred images or point clouds. In their follow-up work (Durner et al. 2017), the Common Analysis Structure is used to optimize perception parameters by users providing ground truth labels.

Similarly, Oliveira et al. proposed a perception database using LevelDB to enable object category learning from users (2014). Instead of regions of raw point cloud data, user mediated key views of the same object are stored linking to one object category.

Wang et al. utilized a relational database as cloud robotics storage so multiple low-end robots can retrieve 3D laser scan data from a high-end robot, which has a laser sensor and its data being processed onboard with more storage and better computation power (2012). Specifically, low-end robots can send a query with their poses on a map to retrieve 3D map data and image data. PostgreSQL is used but the data structure detail is not discussed, as the paper focuses on resource allocation and scheduling. However, a local data buffer on robots is proposed to store frequently accessed data to reduce the database access latency bottleneck.

Dietrich et al. used Cassandra to store and query 2D and 3D map data with spatial context such as building, floor, and room (2014). There are several benefits of using Cassandra, such as the ability to have a local server that can query both local data and remote data, avoiding single point failure. Developers can also define TTLs (Time to Live) to remove data automatically, avoiding a maintenance burden.

In addition, Fourie et al. leveraged a graph database, Neo4j, to link vision sensor data stored in MongoDB to pose-keyed data (2017). Graph databases allow complex queries with spatial context for multiple mapping mobile robots, which enables multi-robot mapping. This line of research focuses on storing processed data but did not discuss a way to link raw data back to the processed data. This is important because not storing linked raw data may lead to loss of information during retrospection. There is a trend that other types of database systems, e.g., relational database (PostgreSQL) and key-value database (LevelDB and Cassandra) are used to store those processed data, because only a few ever-evolving data structures need to be stored.

3.3 Querying

There is no unified method for querying; most are application specific, such as efficient debugging (Niemueller, Lakemeyer, and Srinivasa 2012Balint-Benczédi et al. 2017) and task representation (Beetz, Tenorth, and Winkler 2015Tenorth et al. 2015). Interfaces are also tightly coupled to programming languages: JavaScript from MongoDB (Niemueller, Lakemeyer, and Srinivasa 2012), Prolog (Beetz, Tenorth, and Winkler 2015Tenorth et al. 2015) and SQL (Dietrich, Zug, and Kaiser 2015).

In mongodb_log, Niemueller, Lakemeyer and Srinivasa proposed a knowledge hierarchy for manipulation tasks to enable efficient querying for debugging (2012). The knowledge hierarchy consists of all raw data and the poses of the robot and manipulated objects, all of which are timestamped. When a manipulation task fails, a top-down search is performed in the knowledge hierarchy in a specific time range. Poses are at the root of the hierarchy and raw data, such as coordinate frames and point cloud data, are replayed in a visualization tool for further investigation (i.e., Rviz in ROS). The query language is JavaScript using the MapReduce paradigm, which supports aggregation of data natively.

Beetz, Tenorth, and Winkler proposed Open-EASE, a web interface for robotic knowledge representation and processing for developers (Beetz, Tenorth, and Winkler 2015Tenorth et al. 2015). Robotics and AI researchers are able to encapsulate manipulation tasks semantically as temporal events with sets of predefined semantic predicates. Manipulation episodes are logged by storing low-level data, which are the environment model, object detection results and poses, and planned tasks in an XML-based Web Ontology Language (OWL) (W3C 2009). High-velocity raw data such as sensor data and robot poses are logged in a schemaless MongoDB database. Querying uses Prolog with a predefined concept vocabulary, similar to the semantic predicates.

While Open-EASE allows semantic querying, it does not come easily. One disadvantage is the introduction of a different programming paradigm, logic programming in Prolog, which robot developers have to learn for querying regardless of the paradigm being used for robot programming. It is also unclear how to extend the pre-defined semantic predicates for other generic tasks in different environments.

Balint-Benczédi et al. use a similar high level description language to replace the JavaScript query feature in MongoDB (2017) to avoid the in-depth knowledge requirement of the internal data structure. The description language also contains predefined predicates and can be queried through Prolog. This work has the same drawbacks as Open-EASE.

Interestingly, Dietrich, Zug, and Kaiser proposed SelectScript, a SQL-inspired query-only language for robotic world models without having relevant tables in the database (2015). Without using a different programming language to specify how to retrieve data, SelectScript provides a declarative and language-agnostic way to specify what data are needed rather than how. SelectScript also features custom function support to queries and custom return type native to robotic applications such as an occupancy grid map.

While SelectScript is modeled on the well-known standard SQL, but it is not language-agnostic as stated. Custom functions are only supported in Python, leaving ROS C++ programmers behind. Except for requiring significant effort to support C++, it is not trivial to extend return type to new data types such as the popular Octomap used in 3D mapping (Hornung et al. 2013) for obstacle avoidance in SelectScript.

Fourie et al. proposed to use a graph database to query spatial data from multiple mobile robots (2017). However, it is not plausible for our use given that only one relationship is used: odometry poses linked to image and RGBD data. This work also suffers the same drawbacks of SelectScript in that custom queries have to be programmed in Java.

Similar to our argument in the previous storage sections, robotic database designers should embrace the programming languages with which robotic developers are already familiar. Database technology should be hidden by interfaces written in programming languages that also support access to the underlying database for advanced and customized use.

3.4 Proposed Approach

MongoDB’s use has been proven by robotic developers and should thus be chosen to store low-level sensor data, which is potentially large and high velocity. This is mainly to replace the rosbag utility1 which relies on a filesystem and is not easy to query. To store summaries at different levels and to link them, a relational database should be chosen because it is specialized to store relational data. Additional columns will be used as reference to the sensor data in MongoDB. When data are deleted in MongoDB or the relational database, the linked data should be deleted as well; this can be achieved by a background job system or enforced in the programming interface to be discussed below.


Instead of directly exposing the database to robot developers, the storage and querying interface should be written in the same programming language the developer is using for the robot system. The programming interface should be written in an object-oriented manner to allow easy extension (e.g., from a single robot to a cluster). A minimum subset of functional programming should also be used to support custom functions, similar to SelectScript (Dietrich, Zug, and Kaiser 2015). For common use cases, the determination of which data storage method to use should be handled by the interface so users do not need to be concerned with the underlying database being used. However, a raw interface that enables developers to directly communicate with the underlying database should not be completely left out, to allow for use cases that are not in consideration by interface designers. In terms of programming languages, C++ should be used, given its popularity among developers and most packages of the ROS framework. Python support should also be provided through binding the C++ implementation2 .

2 a C++ class in Python

Human interface developers should be able to query using indexes such as state summarization level, time range, and others specific to the domain. Custom queries should also be allowed by having interface functions exposing the database, as interface designers are not able to consider an exhaustive list of all use cases. Human interface developers should also be able to easily query linked follow-up data after the first query. This is essential for the human interface to provide interaction (e.g., interactive conversation or projection).

Human Interface

A human interface is used to communicate the explanations generated by the robot. Communication of the explanations can occur in different channels, such as a traditional graphical user interface (GUI) on a monitor, head-mounted displays, and robot movements. While some human interface methods have been studied for decades in the HCI community, our literature review is selective. We focus largely on novel approaches (e.g., AR techniques) and the most prominent related work, justified by the citation number relative to the publication year. There is a large body of research for some techniques with existing comprehensive literature reviews. We direct readers to the following papers: eye gaze in social robotics (Admoni and Scassellati 2017), using animation techniques with robots (Schulz, Torresen, and Herstad 2019) (which provides 12 design guidelines), speech and natural language processing for robotics (Mavridis 2015), and tactile communication via artificial skins in social robots (Silvera-Tawil, Rye, and Velonaki 2015).

4.1 Display Screen

While computer interfaces largely use a display screen for the primary communication channel, screens on robots are largely used to display facial expressions (Kalegina et al. 2018) due to their physicality, but are considered less convenient than speech (de Jong et al. 2018). For co-location scenarios, it is rare to find a display screen as part of a robot that is not attached to its head, so very little research has been performed for simple displays or visualizations of sensor values or other relevant information.

Brooks investigated displaying a general set of state icons on the body of robots to indicate internal states (2017). Five icons – OK, Help, Off, Safe, and Dangerous – were shown to participants for evaluation. The results show that while bystanders are able to understand those icons, their level of understanding is vague. For example, the ”Off” icon could be interpreted as stating that the robot is powered off or that it is just not currently actively operating.

SoftBank Robotics’ Pepper robot is one of the only robot systems that features a touch screen not attached to its head. Feingold-Polak et al. found that people enjoyed interacting with a touch screen on a robot more than using a computer screen with a keyboard (2018). Specifically, participants preferred to use the touch screen to indicate the completion of a task. de Jong et al. used Pepper’s screen to present buttons to use for inputting instructions, such as object directions (2018). Bruno et al. used Pepper’s touch screen like a tablet where multiple-choice questions are shown and users can answer by tapping on the choices (2018).

While a display screen has been demonstrated to be effective at showing accurate information (e.g., replaying past events (Jeong et al. 2017), which can be used during explanations), there is sometimes a mental conversion issue where humans have to map what is displayed on the screen to the physical environment. A display screen may also suffer from being less readable from a longer distance, which is important as such proximity to a robot may not be safe during certain failure cases (Honig and Oron-Gilad 2018).

4.2 Augmented Reality (AR)

Utilizing AR for explainability allows visual cues to be projected directly into the environment with which the robot interacts, allowing for more specificity and reference points to be drawn. This technique can make explanations more accurate, less ambiguous, and remove the burden of mental mapping between different reference frames (e.g., 2D display screen compared to the real world 3D environment).

Andersen et al. proposed to use a projector to communicate a robot’s intent and task information onto the workspace to facilitate human-robot collaboration in a manufacturing environment (2016). The robot locates a physical car door using an edge-based detection method, then projects visualizations of parts onto it to indicate its perception and intended manipulation actions. The authors also conducted an experiment by asking participants to collaboratively rotate and move cubes with the robot arm, comparing the AR projector method to the use of a display screen with text. Results show there were fewer performance errors and questions asked by the participants when using the projector method.

For mobility, researchers also leveraged projection techniques onto the ground to show robot intention. Chadalavada et al. projected a green line to indicate the planned path and two white lines to the left and right of the robot to visualize the collision avoidance range of the robot (2015). Gradient light bands have also been used to show a robot’s path (Watanabe et al. 2015). Similarly, Coovert et al. projected arrows to show the robot’s path (2014), while Daily et al. used a head-mounted display to visualize the robot’s path onto the user’s view of the environment (2003). Circles have also been used to show landmarks on a robot swarm using a projector located above the performance space (Ghiringhelli et al. 2014). However, the AR techniques utilize in these papers are passive and not interactive. Chakraborti et al. proposed using Microsoft HoloLens to enable a user to interact with AR projections (2018), where users can use pinch gestures to move a robot’s arm or base, start or stop robot movement, and pick a block for stacking.

AR may be more salient than a display screen for our use case, but it does have some drawbacks. For example, it cannot be used for a robot to take initiative for explanations: This is because it can be easily ignored if a human is not paying attention to the projected area.

4.3 Robot Activities

diagram of explanation system
Figure 2:The workflow of the robot explanation system. State summaries are saved in the databases which are hidden by programming interfaces. A multi-modal human interface queries the database through another programming interface and answers follow-up conversation and interactions. The state summarization component can also initiate explanations.

Due to the physicality of robot systems, body language of robots has been studied extensively in the HRI community to communicate intent. For example, Dragan et al. proposed using legible robot arm movements to allow people to quickly infer the robot’s next grasp target (2015). Repeated arm movement has also been proposed to communicate a robot’s incapability to pick up an object (Kwon, Huang, and Dragan 2018). Eye gaze behavior or head movement has also been studied (e.g., (Moon et al. 2014Admoni and Scassellati 2017)). However, this communication method is limited in the amount of information that it can convey, if used as the only channel of communication.

In addition to robot movements, researchers have also explored auxiliary methods of communication such as light. Notably, the Rethink Robotics Baxter system utilizes a ring of lights on its head to indicate the distance of humans moving nearby to support safe HRI. Similarly, Szafir, Mutlu, and Fong used light to indicate the flying direction of a drone when co-located with humans in close proximity (2015); the results show improvement in response time and accuracy.

4.4 Proposed Approach

Given that speech is a natural interaction method for people, it should be considered for initial explanations. It can be used to garner a person’s attention and to initiate high-level summary explanations. However, it is limited in that only one audio stream is available at a time (i.e., the human cannot listen to multiple streams simultaneously without interference). Body language can be used to supplement speech explanations, but likely should not be used as the only communication channel. For example, robots can use arm movements to refer to relevant geography of the robot (e.g., components, actions) and the task space (e.g., objects, areas of the environment) simultaneously with another communication channel such as speech.

When a human requests more detailed explanations, other communication channels should be used to avoid misinterpretation. One such interface is the AR projection method in the literature, given the limitations of display screens (i.e., availability, size, reading distance). While using a projection method, one should keep in mind that projection on ground is not always visible (Chakraborti et al. 2018).

Communication of explanations may occur in the three temporal levels — a priori, in situ, or post hoc — which will impact the effectiveness of a chosen communication channel. More detailed, in-depth communications via visual or audio means may be better suited for explanations of planned actions (a priori) or analysis of resulting actions (post hoc). Simpler techniques for alerting people (e.g., flashing lights, vibrating tactors) may be better suited for conveying state information in situ, at least to garner attention before more information is conveyed.

Thus, communication should be multi-modal as some methods are better suited for different levels of explanations, temporal levels, and data types, but also need to ensure the human understands all possible means of communication.


Next we describe the integration of the three components of our approach for and their interconnections in the robot explanation system. The proposed component designs are intended to serve as guidelines for development with reasonable justifications, rather than compulsory decisions. In addition, the design of each component should be self-contained and decoupled from the operation of other components, allowing each component to be used independently.

5.1 System Review and Workflow

A diagram of the system is shown in Figure 2. There are two main methods for state summarization: manual methods and summarization algorithms. Manually, developers can annotate functions or specify goals as leaves in a tree structure in the robotic applications to provide explanations. When using summarization algorithms, explanations can be learned using end-to-end or semi-supervised deep learning from robot states and annotator-provided explanations. Summarization algorithms should also be able generalize summaries online when the stored summaries are not sufficient to answer some users’ questions. While generating explanations, the state summarization component can initiate explanations if necessary (e.g., in cases of incapability or failures).

The generated state summaries and their linked raw data are then saved to databases through a programming interface, currently using C++ or Python. Two databases are used: MongoDB for storing raw sensor data and a relational database to store explanations (i.e., linked summaries). The two data storage methods are mostly hidden by the interface, allowing developers to use the programming language and avoid knowing database details. However, the interface will also provide ways to directly access each database if needed.

With the stored summaries in the database, after the robot or human initiates communication, the human interface is responsible for all follow-up conversations or interactions by passing the semantics to the state summarization component. Communication should occur in multiple modalities including speech, body language, screen, and AR (e.g., projection techniques). Due to the differences of each communication method in terms of fidelity, attention-getting, etc., the system will utilize multi-modal communication.

5.2 Evaluation

To evaluate the effectiveness of system, usability testing should be performed with users of varying experience to assess the acceptance and understanding of the system.

Figure 3:The FetchIt! task environment we recommend for evaluation. The robot’s goal is to place the irregular parts into the correct sections of the concave caddy, and transport the caddy to the bottom-left table for inspection.

For an example scenario and tasks, we recommend using the FetchIt! mobile manipulation challenge (fet ); Figure 3 shows a rendering of the task space. The tasks are to assemble a kit by navigating to collect parts on different station; while scoped for a manufacturing environment, the same types of tasks are relevant to home environments: e.g., navigating between areas in a narrow hallway kitchen and a dining table, and manipulating objects in these places. The challenges are also not singular to a work cell manufacturing environment. One challenge is detecting different objects, whose shapes are complex and irregular (e.g., large bolts and gearbox parts; kitchen utensils would be similarly difficult). Invisible from the rendered task space, the screws are in the blue container. Another challenge is to navigate through the narrow and constrained work cell; the available space and obstructions therein are similar to that of a kitchen.

The FetchIt! environment provides a reasonable test bed for a robot explanation system because there are several opportunities for unexpected or opaque events to occur. For example, a common occurrence is that the Fetch robot may not be able to grasp a caddy or a gearbox part if it is placed too close to a wall. Fetch’s arm may not be long enough to reach given the constraints presented by the end-effector orientation (it must be pointed down in order to grasp the caddy) and standoff distances imposed by the dimensions of the tables. These scenarios are not apparent to novice users or bystanders who do not have intimate knowledge of Fetch’s characteristics. In this scenario, Fetch should initiate an explanation to inform the user. Another common occurrence is confusion when differentiating between two gearbox parts that appear similar in height via point cloud due to sensor noise. This can make the object detection fail, causing the robot to grasp the incorrect object. In this scenario, a human might initiate a robot explanation, as the robot may not be aware that it performed incorrectly. Another human-initiated robot explanation might be when the robot stops at a different location in front of the caddy table than what was expected, and places a part into the incorrect caddy. This could occur due to navigation error range and the narrow horizontal field of view (54∘) of its RGBD camera, which may cause part of the caddy to be occluded.

The FetchIt! competition testbed is available for Gazebo on GitHub: A working implementation, including navigation and manipulation, is available at

Future Work

To date, we have started implementing the robot explanation system. The system will be open-sourced to facilitate advancing research and to assist other practitioners in integrating their existing software with the system. We plan to evaluate the implementation with a formal HRI user study and analyze the results for further improvements.


This paper presents a survey of the three components of the robot explanation system. For state summarization, manually generated summaries may be the workaround solution due to lack of maturity of learning methods that required more research effort. For storage and querying, the programming interface should be developed for easy integration from state summarization and human interface developers. Multi-modal human interface communication methods should be used not only to garner attention from humans in initiating an explanation, but also as the enabling methods to convey both high level summarized explanations and low level detailed explanations.


This work has been supported in part by the Office of Naval Research (N00014-18-1-2503).


    Abdul, A.; Vermeulen, J.; Wang, D.; Lim, B. Y.; and Kankanhalli, M. 2018. Trends and trajectories for explainable, accountable and intelligible systems: An hci research agenda. In Proceedings of the 2018 CHI Conference on Human Factors in Computing Systems, 582:1–582:18.

    Adadi, A., and Berrada, M. 2018. Peeking inside the black-box: A survey on explainable artificial intelligence (xai). IEEE Access 6:52138–52160.

    Admoni, H., and Scassellati, B. 2017. Social eye gaze in human-robot interaction: a review. Journal of Human-Robot Interaction 6(1):25–63.

    Admoni, H.; Weng, T.; Hayes, B.; and Scassellati, B. 2016. Robot nonverbal behavior improves task performance in difficult collaborations. In The Eleventh ACM/IEEE International Conference on Human Robot Interaction, 51–58.

    Andersen, R. S.; Madsen, O.; Moeslund, T. B.; and Amor, H. B. 2016. Projecting robot intentions into human environments. In 2016 25th IEEE International Symposium on Robot and Human Interactive Communication (RO-MAN), 294–301.

    Anjomshoae, S.; Najjar, A.; Calvaresi, D.; and Främling, K. 2019. Explainable agents and robots: Results from a systematic literature review. In Proceedings of the 18th International Conference on Autonomous Agents and MultiAgent Systems, 1078–1088. International Foundation for Autonomous Agents and Multiagent Systems.

    Bainbridge, W. A.; Hart, J.; Kim, E. S.; and Scassellati, B. 2008. The effect of presence on human-robot interaction. In The 17th IEEE International Symposium on Robot and Human Interactive Communication (RO-MAN), 701–706.

    Balint-Benczédi, F.; Márton, Z.; Durner, M.; and Beetz, M. 2017. Storing and retrieving perceptual episodic memories for long-term manipulation tasks. In 2017 18th International Conference on Advanced Robotics (ICAR), 25–31.

    Beetz, M.; Mösenlechner, L.; and Tenorth, M. 2010. CRAM — a cognitive robot abstract machine for everyday manipulation in human environments. In 2010 IEEE/RSJ International Conference on Intelligent Robots and Systems, 1012–1017. IEEE.

    Beetz, M.; Tenorth, M.; and Winkler, J. 2015. Open-EASE — a knowledge processing service for robots and robotics/ai researchers. In 2015 IEEE International Conference on Robotics and Automation (ICRA), 1983–1990.

    Brooks, D.; Shultz, A.; Desai, M.; Kovac, P.; and Yanco, H. A. 2010. Towards state summarization for autonomous robots. In 2010 AAAI Fall Symposium Series.

    Brooks, D. J. 2017. A Human-Centric Approach to Autonomous Robot Failures. Ph.D. Dissertation, Ph. D. dissertation, Department of Computer Science, University.

    Bruno, B.; Menicatti, R.; Recchiuto, C. T.; Lagrue, E.; Pandey, A. K.; and Sgorbissa, A. 2018. Culturally-competent human-robot verbal interaction. In 2018 15th International Conference on Ubiquitous Robots (UR), 388–395.

    Chadalavada, R. T.; Andreasson, H.; Krug, R.; and Lilienthal, A. J. 2015. That’s on my mind! robot to human intention communication through on-board projection on shared floor space. In 2015 European Conference on Mobile Robots (ECMR), 1–6.

    Chakraborti, T.; Sreedharan, S.; Kulkarni, A.; and Kambhampati, S. 2018. Projection-aware task planning and execution for human-in-the-loop operation of robots in a mixed-reality workspace. In 2018 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), 4476–4482. IEEE.

    Cleland-Huang, J.; Gotel, O.; Zisman, A.; et al. 2012. Software and systems traceability, volume 2. Springer.

    Coovert, M. D.; Lee, T.; Shindev, I.; and Sun, Y. 2014. Spatial augmented reality as a method for a mobile robot to communicate intended movement. Computers in Human Behavior 34:241–248.

    Daily, M.; Cho, Y.; Martin, K.; and Payton, D. 2003. World embedded interfaces for human-robot interaction. In 36th Annual Hawaii International Conference on System Sciences, 2003. Proceedings of the, 6–pp. IEEE.

    De Graaf, M. M., and Malle, B. F. 2017. How people explain action (and autonomous intelligent systems should too). In 2017 AAAI Fall Symposium Series.

    de Jong, M.; Zhang, K.; Roth, A. M.; Rhodes, T.; Schmucker, R.; Zhou, C.; Ferreira, S.; Cartucho, J.; and Veloso, M. 2018. Towards a robust interactive and learning social robot. In Proceedings of the 17th International Conference on Autonomous Agents and MultiAgent Systems, 883–891.

    Desai, M.; Kaniarasu, P.; Medvedev, M.; Steinfeld, A.; and Yanco, H. 2013. Impact of robot failures and feedback on real-time trust. In Proceedings of the 8th ACM/IEEE International Conference on Human-robot Interaction, 251–258.

    Dietrich, A.; Zug, S.; Mohammad, S.; and Kaiser, J. 2014. Distributed management and representation of data and context in robotic applications. In 2014 IEEE/RSJ International Conference on Intelligent Robots and Systems, 1133–1140. IEEE.

    Dietrich, A.; Zug, S.; and Kaiser, J. 2015. Selectscript: A query language for robotic world models and simulations. In 2015 IEEE International Conference on Robotics and Automation (ICRA), 6254–6260.

    Dragan, A. D.; Bauman, S.; Forlizzi, J.; and Srinivasa, S. S. 2015. Effects of robot motion on human-robot collaboration. In Proceedings of the Tenth Annual ACM/IEEE International Conference on Human-Robot Interaction, 51–58. ACM.

    Dragan, A. D.; Lee, K. C.; and Srinivasa, S. S. 2013. Legibility and predictability of robot motion. In Proceedings of the 8th ACM/IEEE international conference on Human-robot interaction, 301–308.

    Durner, M.; Kriegel, S.; Riedel, S.; Brucker, M.; Márton, Z.-C.; Bálint-Benczédi, F.; and Triebel, R. 2017. Experience-based optimization of robotic perception. In 2017 18th International Conference on Advanced Robotics (ICAR), 32–39.

    Ehsan, U.; Tambwekar, P.; Chan, L.; Harrison, B.; and Riedl, M. O. 2019. Automated rationale generation: a technique for explainable ai and its effects on human perceptions. In Proceedings of the 24th International Conference on Intelligent User Interfaces (IUI), 263–274.

    Feingold-Polak, R.; Elishay, A.; Shahar, Y.; Stein, M.; Edan, Y.; and Levy-Tzedek, S. 2018. Differences between young and old users when interacting with a humanoid robot: a qualitative usability study. Paladyn, Journal of Behavioral Robotics 9(1):183–192.

    Fetchit!, a mobile manipulation challenge. Accessed: 2019-09-11.

    Fourie, D.; Claassens, S.; Pillai, S.; Mata, R.; and Leonard, J. 2017. Slamindb: Centralized graph databases for mobile robotics. In 2017 IEEE International Conference on Robotics and Automation (ICRA), 6331–6337.

    Ghiringhelli, F.; Guzzi, J.; Di Caro, G. A.; Caglioti, V.; Gambardella, L. M.; and Giusti, A. 2014. Interactive augmented reality for understanding and analyzing multi-robot systems. In 2014 IEEE/RSJ International Conference on Intelligent Robots and Systems, 1195–1201.

    Guidotti, R.; Monreale, A.; Ruggieri, S.; Turini, F.; Giannotti, F.; and Pedreschi, D. 2018. A survey of methods for explaining black box models. ACM computing surveys (CSUR) 51(5):93.

    Haidarian, H.; Dinalankara, W.; Fults, S.; Wilson, S.; Perlis, D.; Schmill, M.; Oates, T.; Josyula, D.; and Anderson, M. 2010. The metacognitive loop: An architecture for building robust intelligent systems. In PAAAI Fall Symposium on Commonsense Knowledge (AAAI/CSK’10).

    Hayes, B., and Shah, J. A. 2017. Improving robot controller transparency through autonomous policy explanation. In Proceedings of the 2017 ACM/IEEE international conference on human-robot interaction, 303–312.

    Honig, S., and Oron-Gilad, T. 2018. Understanding and resolving failures in human-robot interaction: Literature review and model development. Frontiers in psychology 9:861.

    Hornung, A.; Wurm, K. M.; Bennewitz, M.; Stachniss, C.; and Burgard, W. 2013. OctoMap: An efficient probabilistic 3D mapping framework based on octrees. Autonomous Robots. Software available at

    Jeong, S.-Y.; Choi, I.-J.; Kim, Y.-J.; Shin, Y.-M.; Han, J.-H.; Jung, G.-H.; and Kim, K.-G. 2017. A study on ros vulnerabilities and countermeasure. In Proceedings of the Companion of the 2017 ACM/IEEE International Conference on Human-Robot Interaction, 147–148.

    Kalegina, A.; Schroeder, G.; Allchin, A.; Berlin, K.; and Cakmak, M. 2018. Characterizing the design space of rendered robot faces. In Proceedings of the 2018 ACM/IEEE International Conference on Human-Robot Interaction, 96–104.

    Kaptein, F.; Broekens, J.; Hindriks, K.; and Neerincx, M. 2017. Personalised self-explanation by robots: The role of goals versus beliefs in robot-action explanation for children and adults. In 2017 26th IEEE International Symposium on Robot and Human Interactive Communication (RO-MAN), 676–682. IEEE.

    Kwon, M.; Huang, S. H.; and Dragan, A. D. 2018. Expressing robot incapability. In Proceedings of the 2018 ACM/IEEE International Conference on Human-Robot Interaction, 87–95.

    Lasota, P. A.; Fong, T.; Shah, J. A.; et al. 2017. A survey of methods for safe human-robot interaction. Foundations and Trends® in Robotics 5(4):261–349.

    Malle, B. F. 2006. How the mind explains behavior: Folk explanations, meaning, and social interaction. MIT Press.

    Mavridis, N. 2015. A review of verbal and non-verbal human–robot interactive communication. Robotics and Autonomous Systems 63:22–35.

    Miller, T. 2019. Explanation in artificial intelligence: Insights from the social sciences. Artificial Intelligence 267:1 – 38.

    Moon, A.; Troniak, D. M.; Gleeson, B.; Pan, M. K.; Zheng, M.; Blumer, B. A.; MacLean, K.; and Croft, E. A. 2014. Meet me where i’m gazing: how shared attention gaze affects human-robot handover timing. In Proceedings of the 2014 ACM/IEEE international conference on Human-robot interaction, 334–341.

    Niemueller, T.; Abdo, N.; Hertle, A.; Lakemeyer, G.; Burgard, W.; and Nebel, B. 2013. Towards deliberative active perception using persistent memory. In Proc. IROS 2013 Workshop on AI-based Robotics.

    Niemueller, T.; Lakemeyer, G.; and Srinivasa, S. S. 2012. A generic robot database and its application in fault analysis and performance evaluation. In 2012 IEEE/RSJ International Conference on Intelligent Robots and Systems, 364–369.

    Oliveira, M.; Lim, G. H.; Lopes, L. S.; Kasaei, S. H.; Tomé, A. M.; and Chauhan, A. 2014. A perceptual memory system for grounding semantic representations in intelligent service robots. In 2014 IEEE/RSJ International Conference on Intelligent Robots and Systems, 2216–2223.

    Quigley, M.; Conley, K.; Gerkey, B.; Faust, J.; Foote, T.; Leibs, J.; Wheeler, R.; and Ng, A. Y. 2009. ROS: an open-source Robot Operating System. In ICRA Workshop on Open Source Software,  5.

    Ravichandran, R.; Prassler, E.; Huebel, N.; and Blumenthal, S. 2018. A workbench for quantitative comparison of databases in multi-robot applications. In 2018 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), 3744–3750.

    Schraagen, J. M.; Chipman, S. F.; and Shalin, V. L. 2000. Cognitive task analysis. Psychology Press.

    Schulz, T.; Torresen, J.; and Herstad, J. 2019. Animation techniques in human-robot interaction user studies: A systematic literature review. ACM Transactions on Human-Robot Interaction (THRI) 8(2):12.

    Seo, S. H.; Geiskkovitch, D.; Nakane, M.; King, C.; and Young, J. E. 2015. Poor thing! would you feel sorry for a simulated robot?: A comparison of empathy toward a physical and a simulated robot. In Proceedings of the Tenth Annual ACM/IEEE International Conference on Human-Robot Interaction, 125–132.

    Silvera-Tawil, D.; Rye, D.; and Velonaki, M. 2015. Artificial skin and tactile sensing for socially interactive robots: A review. Robotics and Autonomous Systems 63:230–243.

    Szafir, D.; Mutlu, B.; and Fong, T. 2015. Communicating directionality in flying robots. In 2015 10th ACM/IEEE International Conference on Human-Robot Interaction (HRI), 19–26.

    Tenorth, M.; Winkler, J.; Beßler, D.; and Beetz, M. 2015. Open-EASE: A cloud-based knowledge service for autonomous learning. KI-Künstliche Intelligenz 29(4):407–411.

    W3C. 2009. OWL 2 web ontology language document overview. W3C recommendation, W3C.

    Wang, L.; Liu, M.; Meng, M. Q.-H.; and Siegwart, R. 2012. Towards real-time multi-sensor information retrieval in cloud robotic system. In 2012 IEEE International Conference on Multisensor Fusion and Integration for Intelligent Systems (MFI), 21–26.

    Wang, F.; Jiang, M.; Qian, C.; Yang, S.; Li, C.; Zhang, H.; Wang, X.; and Tang, X. 2017. Residual attention network for image classification. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 3156–3164.

    Wang, N.; Pynadath, D. V.; and Hill, S. G. 2016. The impact of pomdp-generated explanations on trust and performance in human-robot teams. In Proceedings of the 2016 international conference on autonomous agents & multiagent systems, 997–1005. International Foundation for Autonomous Agents and Multiagent Systems.

    Watanabe, A.; Ikeda, T.; Morales, Y.; Shinozawa, K.; Miyashita, T.; and Hagita, N. 2015. Communicating robotic navigational intentions. In 2015 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), 5763–5769.

    Winkler, J.; Tenorth, M.; Bozcuoglu, A. K.; and Beetz, M. 2014. Cramm–memories for robots performing everyday manipulation activities. Advances in Cognitive Systems 3:47–66.

    Zhang, Q.-s., and Zhu, S.-C. 2018. Visual interpretability for deep learning: a survey. Frontiers of Information Technology & Electronic Engineering 19(1):27–39.