WYSD 2022 — HRI 2022 Workshop YOUR study design! Participatory critique and refinement of participants’ studies

“Why Didn’t I Do It?” A Study Design to Evaluate Robot Explanations

Gregory LeMasurier, Alvika Gautam, Zhao Han, Jacob W. Crandall, Holly A. Yanco

The “detect screw” subtree with Assumption Checker nodes (prefixed C, green names) and Action nodes (prefixed A, white names). The root RetryUntilSuccessful node retries the subtree up to three times if a failure or assumption violation occurs. The ReactiveSequenceNode asynchronously runs all actions including “lift torso”, “look at table”, and “detect screw” while continuously checking the assumption, “check updated head3DCamera”. An example of pre and post-conditions can be seen before and after the “detect screw” action.
The “detect screw” subtree with Assumption Checker nodes (prefixed C, green names) and Action nodes (prefixed A, white names). The root RetryUntilSuccessful node retries the subtree up to three times if a failure or assumption violation occurs. The ReactiveSequenceNode asynchronously runs all actions including “lift torso”, “look at table”, and “detect screw” while continuously checking the assumption, “check updated head3DCamera”. An example of pre and post-conditions can be seen before and after the “detect screw” action.
  • Feb 15, 2022

    Our experiment design paper led by Gregory LeMasurier was accepted to the HRI 2022 workshop "Workshop YOUR study design! Participatory critique and refinement of participants’ studies"! I am looking forward to the feedback!


As robot systems are becoming ubiquitous in more complex tasks, there is a pressing need for robots to be able to explain their behaviors in order to gain trust and acceptance. In this paper, we discuss our plan for an online human subjects study to evaluate our new system for explanation generation. Specifically, the system monitors violations in the expected environment behavior and hardware. These monitors, referred to as Assumptions Checkers, are incorporated in a Behavior Tree that represents the robot’s behavior and state during a task. This integrated structure enables a robot to detect and explain failures to humans along with underlying causes. Through this work we hope to improve the ability of robots to communicate anomalies in human-robot shared workspaces.

Index Terms—Behavior explanation, behavior trees, assumption checkers, robot explanation generation, robot transparency

I. Introduction

As technology advancements enable robot systems to be more autonomous and complete more complex tasks, it is necessary for robot systems to be capable of explaining their behaviors. In human-robot shared workspaces, such as warehouses or manufacturing floors, it is especially important for a robot to assist the human and collaborate on completing the tasks efficiently while also being capable of providing explanations to surrounding workers, e.g., when a robot encounters a system failure or when it makes an unexpected observation in the environment, thus enabling the human co-workers to assist them.

For these explanations, it is important to have an appropriate level of detail (i.e., sufficient but not overwhelming) for non-expert human coworkers to be able to resolve anomalies which the robot may not correct on its own  [1]. Behavior Trees can be leveraged by robot systems to automatically generate explanations. A Behavior Tree (BT)  [2] is a task sequence method that can be used to represent a robot’s internal states and actions for the execution for robot tasks. In our previous work on using BTs to facilitate robot explanations  [3], we framed BTs into semantic sets: {goal, subgoals, steps, actions}. These semantic sets are used to create shallow hierarchical explanations and answer follow-up questions when users ask for details, the preferred method identified by Han et al.   [1].

While our prior work was capable of explaining what action the robot failed to accomplish, the system did not have enough information to provide explanations as to why the action failed. Understanding why a robot failed is necessary so that human coworkers can provide assistance. For example, consider a scenario where a human coworker is tasked with supplying the robot with screws and refilling the screw container for an assembly. If the robot fails to pick up a screw, it is easier for the human to assist the robot if it knows why the robot failed (e.g., the screw was too far from its arm or the robot could not see the screw). Based on the explanation, the human could move the screw container to a reachable location for the robot or make sure that the robot’s view is not blocked. Additionally, as the human becomes aware of these potential anomalies, it can lead to better teamwork with the robot in the future.

For a robot to explain why it failed to complete a task, it must be able to identify the cause of failure. Robotic systems often make assumptions about their system capabilities and expected environment state. These sets of assumptions and biases, either intentionally or unintentionally, are found in decision-making algorithms and they dictate the generator’s performance  [4]. Considering the screw example again, the robot’s motion planner that outputs a path to reach and pick up the screw assumes that the location of the screw determined by the vision system is accurate. Thus, when these assumptions and biases are correct, the autonomous robot’s behavior and its impact on the world are predictable, leading to the perception of higher performance. However, when the assumptions are not met, the robot’s behavior and its impact on the world are less predictable, leading to failures and degraded performance. Thus, communicating any assumption violations about the algorithm inputs (i.e., expected system and environment state) and their expected outputs is an important aspect of explaining a robot’s behavior as it gives an insight into the robot’s awareness of its limits towards completing a task successfully. This outcome is supported by the findings of Das et al.  [5] who found that explanations consisting of context and action history effectively enable non-experts to identify and provide solutions to errors encountered by a robot system. Additionally, these assumptions can be used to estimate system performance  [6] and take corrective actions  [7]. In this work, we identify the various task-related assumptions to encode these into behavior tree representation. We leverage the status of these assumptions to track failures and generate explanations.

In order to track the status of the above-mentioned assumptions, we implemented monitors and encoded them as a part of the behavior tree. We refer to these monitors as Assumption Checkers (ACs), which are used to continuously track whether assumptions hold during task execution. We implemented Assumption Checker nodes in the behavior tree that represents a mobile manipulation kitting task  [8], so the robot can identify any violations in the expected system and environment state, and can communicate the cause and timing of an action failure by actively tracking these violations. Figure 1 shows the detect screw subtree of the behavior tree with Assumption Checkers.

detect checkers numbered colored circles cropped
Fig. 1. The “detect screw” subtree with Assumption Checker nodes (green labels) and Action nodes (white labels). The root RetryUntilSuccessful node (Node 1) retries the subtree up to three times if a failure or assumption violation occurs. The ReactiveSequenceNode (Node 2) asynchronously runs all actions including “lift torso” (Node 4), “look at table” (Node 5), and “detect screw” (Node 8) while continuously checking the assumption, “check updated head3DCamera” (Node 3). Node 7 checks a precondition assumption while Nodes 6, 9, 10, and 11 check post condition assumptions for their corresponding action node.

With this work, we hope to evaluate explanations generated by our combined system, where explanations consist of the failed action as well as a cause of failure, compared to the original behavior tree system, which only had the ability to indicate an action failure. In this paper, we specifically describe our hypotheses and an experiment design to evaluate our improved system.

II. Task Description

To evaluate our proposed system, we plan to use a Fetch robot  [9] – a mobile manipulator with a chest-mounted seven degree of freedom arm and a head-mounted RGBD camera – in the arena of and with the tasks from the FetchIt! Mobile Manipulation Challenge  [10]. In this challenge, a Fetch robot was tasked with preparing as many kits of gearbox parts as possible. To achieve the goal, the robot needs to navigate to different tables to pick different parts such as screws, gearbox tops/bottoms, large/small gear, then place them into a caddy on another table. The mechanical parts are either lying on the table or in a container on a table.

In this work, we use the subtask of collecting a screw to the kit of parts. The task entities include the robot, object to be picked from a location and goal location where the object needs to be placed. All other objects shown in the figures are considered as obstacles by the robot. For the task, we utilize two tables, as seen in Figure 2, the screw station and the caddy station. The robot starts in the middle of the arena, and then navigates to the screw station, which contains a bin of screws for the robot to pick from. Next the robot navigates to the caddy station and places the picked up screw into a smaller compartment in the caddy.

byu uml fetchit arena 3
Fig. 2.  The FetchIt! challenge task arena, where the robot is initially positioned in the middle. The robot then navigates to the screw station to pick a screw from the green bin, and finally places the screw in the blue caddy on the caddy station.

To evaluate the ability of the robot to finish the task and convey any deviations from the expected behavior, an experimenter will act as a challenger to simulate potential real world failures. The environment and its hardware will be manipulated, such as removing the screw bin while the robot is attempting to pick a screw and introducing a camera failure.

III. Hypotheses

As seen in Zudaire et al.  [7], whose work focused on UAVs, tracking assumptions helped the system to maintain safe, goal-oriented behavior and to take corrective actions in case of any deviations. Thus, by adding assumption checkers to a behavior tree, a robot can identify and respond to errors earlier than if no assumption checkers were included Additionally, with this extra information, we hypothesize that robots can communicate what went wrong with higher quality explanations, which will additionally result in a better human understanding of the robot system and performance.

Explanation evaluations will consist of a user study where participants will be shown video segments of the robot performing the task. The video segments will be followed by a set of questions. Next, we describe some hypotheses about the explanations and corresponding metrics on which we plan to evaluate these explanations.

A. Human Perception of System

1) H1: Perceived Intelligence: The BT+AC condition will result in higher perceived robot intelligence compared to the BT condition. We hypothesize this as BT simply states the failure, while the BT+AC explanation shows that the robot system not only has information that it failed, but also has information as to why it failed. We hypothesize that this additional information will result in people perceiving BT+AC as having a higher level of intelligence.

2) H2: Trustworthiness: BT+AC will result in a more trustworthy system. We hypothesize this as the system is capable of providing more details regarding its failure. Expressing failures has been found to increase the trustworthiness of robot systems  [11]. Prior work has also found that providing more detailed explanations increases the perceived trustworthiness of robot systems   [121314].

B. Explanation Quality

BT+AC will result in higher explanation quality compared to BT alone, where explanation quality is defined by understandability, informativeness, communication time, and temporal quality.

1) H3: Understandability: BT+AC will result in a better understanding of robot functions from user compared to the BT condition. This is supported by Malle et al.  [15] who stated that with causal knowledge and an improved understanding “people can simulate counterfactual as well as future events under a variety of possible circumstances”. Additionally, Das et al. found that explanations consisting of the context of failure and action history enabled non-experts to identify and provide solutions to errors encountered by a robot system   [5].

2) H4: Informativeness: BT+AC will be more informative than BT. Informativeness is defined as “a measurement of how much informational content communication possesses”  [16]. We hypothesize this because the BT+AC condition communicates that there is an error and why the error occurred, whereas in the base condition the robot only has enough information to state that it failed.

3) H5: Communication Time: BT+AC will result in lower communication time compared to BT. Communication time is defined as “the amount of time required for a message to be (1) generated, (2) transmitted, and (3) understood by the recipient”  [17]. In the base condition the robot can only explain that it failed, thus the participant will have to spend more time inferring why the robot failed from the environment resulting in the BT+AC condition having a lower communication time compared to the BT condition.

4) H6: Temporal Quality: BT+AC will have better timing compared to BT. We believe this will hold as in the BT condition, the robot does not have any indication that it failed the task until an action node fails, therefore it cannot communicate the failure as soon as an assumption is violated. Thus, the BT system will explain its failure after executing all actions that it can before failing, whereas the BT+AC condition, the robot will communicate the anomaly and take the appropriate next action as soon as it notices an assumption violation.

IV. Methods

A. Conditions

We have designed a within subjects online user study, using Prolific and Amazon Turk, where participants will observe a total of two scenarios, one for each of the base (BT) and combined (BT+AC) systems. The order of these conditions will be counterbalanced to reduce any learning effects.

B. Procedure

We plan to recruit N = 54 participants and analyze the data using paired-samples t-test. The sample size is determined by an a priori power analysis using G*Power  [18]: “Means: Difference between two dependant means (matched pairs)”. The parameters used in this analysis include: Tails = 2, a large effect size dz of = 0.5, α error probability = 0.05, Power (1 – β error probability) = 0.95.

In this study, participants will first fill out an agreement to participate form and answer a set of demographics questions. Each evaluation with a participant will consist of two scenarios, where a scenario description will be provided to give some context followed by showing corresponding videos to participants. At the end of every video segment, a set of questions will be asked to assess the participant’s experience with the robot. First, the participant will be asked a simple attention check question, such as what color was the robot, to make sure that the participant was legitimate and paying attention. Then the participant will answer a set of end-of-scenario questions as seen in Section IV-D. Finally, after both scenarios, participants will be asked to indicate which system they preferred.

C. Scenario

The two scenarios are of a manufacturing company that produces kits to assemble gearboxes uses a combination of people and robots in its production line. The workers bring parts that are ready to be added to the kit to the tables in the shared workspace. Participants will then see an image of the robot helping to assemble a kit of parts by adding screws into the caddy. In the scenarios, the robot can drive around, pick up, and place a screw into a caddy. The participant is then informed that one day, they notice that the robot has not been assembling kits as quickly as expected. The participant is then tasked with observing the robot so as to figure out why. Next, the participant watches a video where the robot attempts to pick up a screw. In the video, the robot will experience one of the two anomaly conditions which will cause it to fail its task. In the first scenario, the participants will see a worker remove the screw bin from the table to add more screws while the robot is trying to grasp a screw. In the second scenario, the participants will observe a temporary camera failure. As soon as the system identifies that it has failed a particular subgoal, and must retry, it will explain the failure through speech, followed by re-trial of the subgoal.

D. Measures

1) Perceived Intelligence: Participants will rate how they perceive the robot’s intelligence on a 7-item Likert-type scale (H1).

2) Trustworthiness: Participants will be asked to rate the trustworthiness of the robot on a 7-item Likert-type scale (H2).

3) Understandability: Participants will be asked to rate how close robot’s behavior was to expected behavior for the task. Then, they will be asked to match their observation of the robot’s behavior as either “The robot’s behavior was as expected”, “The robot‘s behavior was unexpected but it was able to recover”, or “The robot’s behavior was unexpected and it was unable to complete the task”. If they answered that the robot’s behavior was unexpected, they will be asked when they noticed this unexpected behavior occur as well as what happened that was unexpected. Additionally, participants will be asked to answer an open response question asking what caused the robot to need to retry or fail its task (H3).

4) Informativeness: Participants will be asked if the amount of information included in the explanation was sufficient. Answers will be encoded on a 7-item Likert-type scale (H4). Additionally, participants will be asked what other information they wished the robot would include in its explanation.

5) Communication Time: Participants will be asked to identify when they were able to identify the reason for retry or failure (H5).

6) Temporal Quality: Participants will respond to a 7-item Likert-type question asking to rate their agreement that the robot had explained its actions within a reasonable time (H6).

V. Conclusion

Through this proposed user study, we hope to evaluate our new system for explanation generation, which uses Assumption Checkers in a Behavior Tree to automatically generate more detailed explanations regarding robot failures. We will analyze the time to decision, explanation quality, and human perception of the new combined system. With this work we hope to improve the ability of robots to communicate anomalies in human-robot shared workspaces by providing explanations with an appropriate amount of detail.


This work has been supported in part by the Office of Naval Research (N00014-18-1-2503).


[1] Z. Han, E. Phillips, and H. A. Yanco, “The need for verbal robot explanations and how people would like a robot to explain itself,” ACM Transactions on Human-Robot Interaction (THRI), vol. 10, no. 4, pp. 1–42, 2021.

[2] M. Colledanchise and P. Ögren, Behavior trees in robotics and AI: An introduction. CRC Press, 2018.

[3] Z. Han, D. Giger, J. Allspaw, M. S. Lee, H. Admoni, and H. A. Yanco, “Building the foundation of robot explanation generation using behavior trees,” ACM Transactions on Human-Robot Interaction (THRI), vol. 10, no. 3, pp. 1–31, 2021.

[4] L. S. Fletcher, S. Teller, E. B. Olson, D. C. Moore, Y. Kuwata, J. P. How, J. J. Leonard, I. Miller, M. Campbell, D. Huttenlocher, A. Nathan, and F.-R. Kline, The MIT – Cornell Collision and Why It Happened. Berlin, Heidelberg: Springer Berlin Heidelberg, 2009, pp. 509–548.

[5] D. Das, S. Banerjee, and S. Chernova, “Explainable ai for robot failures: Generating explanations that improve user assistance in fault recovery,” in Proceedings of the 2021 ACM/IEEE International Conference on Human-Robot Interaction, 2021, pp. 351–360.

[6] A. Gautam, T. Whiting, X. Cao, M. A. Goodrich, and J. W. Crandall, “A method for designing autonomous agents that know their limits,” in IEEE International Conference on Robotics and Automation (ICRA), 2022 (to appear).

[7] S. Zudaire, F. Gorostiaga, C. Sánchez, G. Schneider, and S. Uchitel, “Assumption monitoring using runtime verification for uav temporal task plan executions,” in 2021 IEEE International Conference on Robotics and Automation (ICRA). IEEE, 2021, pp. 6824–6830.

[8] Z. Han, J. Allspaw, G. LeMasurier, J. Parrillo, D. Giger, S. R. Ahmadzadeh, and H. A. Yanco, “Towards mobile multi-task manipulation in a confined and integrated environment with irregular objects,” in Proceedings of the 2020 IEEE International Conference on Robotics and Automation (ICRA). IEEE, 2020, pp. 11 025–11 031.

[9] M. Wise, M. Ferguson, D. King, E. Diehr, and D. Dymesich, “Fetch & Freight: Standard platforms for service robot applications,” in Workshop on Autonomous Mobile Service Robots, 2016.

[10] “Fetchit! a mobile manipulation challenge,” https://opensource.fetchrobotics.com/competition, accessed: 2022-02-04.

[11] M. Kwon, S. H. Huang, and A. D. Dragan, “Expressing robot incapability,” in Proceedings of the 2018 ACM/IEEE International Conference on Human-Robot Interaction, 2018, pp. 87–95.

[12] N. Wang, D. V. Pynadath, and S. G. Hill, “Trust calibration within a human-robot team: Comparing automatically generated explanations,” in 2016 11th ACM/IEEE International Conference on Human-Robot Interaction (HRI). IEEE, 2016, pp. 109–116.

[13] N. Wang, D. V. Pynadath, E. Rovira, M. J. Barnes, and S. G. Hill, “Is it my looks? or something i said? the impact of explanations, embodiment, and expectations on trust and performance in human-robot teams,” in International Conference on Persuasive Technology. Springer, 2018, pp. 56–69.

[14] M. Edmonds, F. Gao, H. Liu, X. Xie, S. Qi, B. Rothrock, Y. Zhu, Y. N. Wu, H. Lu, and S.-C. Zhu, “A tale of two explanations: Enhancing human trust by explaining robot behavior,” Science Robotics, vol. 4, no. 37, 2019.

[15] B. F. Malle, How the mind explains behavior: Folk explanations, meaning, and social interaction. Mit Press, 2006.

[16] A. Gatt and E. Krahmer, “Survey of the state of the art in natural language generation: Core tasks, applications and evaluation,” Journal of Artificial Intelligence Research, vol. 61, pp. 65–170, 2018.

[17] J. A. Marvel, S. Bagchi, M. Zimmerman, and B. Antonishek, “Towards effective interface designs for collaborative hri in manufacturing: metrics and measures,” ACM Transactions on Human-Robot Interaction (THRI), vol. 9, no. 4, pp. 1–55, 2020.

[18] F. Faul, E. Erdfelder, A. Buchner, and A.-G. Lang, “Statistical power analyses using g* power 3.1: Tests for correlation and regression analyses,” Behavior research methods, vol. 41, no. 4, pp. 1149–1160, 2009.