Communicating Missing Causal Information to Explain a Robot’s Past Behavior

Zhao Han; Holly A. Yanco

THRI, 2023 — ACM Transactions on Human-Robot Interaction (THRI), 12(1), March, 2023

Communicating Missing Causal Information to Explain a Robot’s Past Behavior

Zhao Han and Holly A. Yanco

Journal Paper

Videos

Robot Explanation, Behavior Trees (BTs), Explainable AI (XAI), Mobile Manipulation, Augmented Reality (AR), Robot Failures, Robot Task Representation

News

Sep 22, 2022
[THRI] Another journal paper on explaining a robot's past behavior was just accepted to the ACM Transactions on Human-Robot Interaction (THRI)!

Contents

Abstract

Robots need to explain their behavior to gain trust. Existing research has focused on explaining a robot’s current behavior, yet it remains unknown yet challenging how to provide explanations of past actions in an environment that might change after a robot’s actions, leading to critical missing causal information due to moved objects.

We conducted an experiment (N=665) investigating how a robot could help participants infer the missing causal information by replaying the past behavior physically, using verbal explanations, and projecting visual information onto the environment. Participants watched videos of the robot replaying its completion of an integrated mobile kitting task. During the replay, the objects are already gone, so participants needed to infer where an object was picked, where a ground obstacle had been, and where the object was placed.

Based on the results, we recommend combining physical replay with speech and projection indicators (Replay-Project-Say) to help infer all the missing causal information (picking, navigation, and placement) from the robot’s past actions. This condition had the best outcome in both task-based — effectiveness, efficiency, and confidence — and team-based metrics — workload and trust. If one’s focus is efficiency, we recommend projection markers for navigation inferences and verbal markers for placing inferences.

*This work was completed while Zhao Han was affiliated with the University of Massachusetts Lowell. As of September 2022, he is currently a Post-Doctoral Fellow at the Colorado School of Mines and can be reached at zhaohan@mines.edu.

This work has been supported in part by the Office of Naval Research (N00014-18-1-2503). Vittoria Santoro and Jenna Parrillo contributed to the replay implementation. Thanks to Dr. Aaron Steinfeld, Dr. Reza Ahmadzadeh and the reviewers for their thoughtful feedback.

Authors’ addresses: Zhao Han, zhao_han@student.uml.edu, University of Massachusetts Lowell, 1 University Ave., Lowell, MA, USA, 01854; Holly A. Yanco, holly@cs.uml.edu, University of Massachusetts Lowell, 1 University Ave., Lowell, MA, USA, 01854.

CCS Concepts: • Computer systems organization → Robotics; • Human-centered computing → Empirical studies in interaction design; Mixed / augmented reality.

Additional Key Words and Phrases: Robot explanation, Behavior explanation, System transparency

1 Introduction

Current research has focused on investigating in-situ explanations of robots’ current actions, which happen at or around the moment. For example, researchers recently have investigated the use of images of kitchen cleaning tasks with a description explaining the robot’s current behavior [79]. Other research includes a robot explaining why it undesirably blocked a TV while a person is watching TV at the moment [76]. Technical approaches include explanation generation of robot’s current behaviors using function annotation in assembly lines [48], an encoder-decoder model [24], and behavior trees [40].

Yet it remains relatively unknown how a robot would communicate explanations of its past actions. Providing explanations for past behavior is especially interesting yet challenging because the environment might change after the robot’s actions. For example, objects might be moved after manipulation tasks or obstacles on the floor might be removed by people after navigation tasks (e.g., a wet floor sign removed by cleaning staff once the floor becomes dry). These environment changes lead to some missing causal information which may confuse people when the robot later explains its behavior with references to objects that were only present in the past.

Robots must account for these scenarios and provide indications to help people infer the missing causal information that might not be present while explaining. Indeed, researchers in psychology have found that humans hope to gain causal knowledge from explanations by others [13, 57]. When the causal knowledge is present, understanding is improved and “people can simulate counterfactual as well as future events under a variety of possible circumstances” [59].

Despite these benefits of such explanations, the replaced objects present difficulties in how a robot could explain in such post-hoc scenarios. In a worst scenario, the replaced object might be caused by a robot’s failure in object recognition and manipulation which the robot is not aware of. Because of the unawareness, any explanations about the key causal missing information by the robot may not make sense due to inconsistency between what humans observe and the robot recognized. In addition to this manipulation scenario, a robot navigating in environments with humans replacing objects, such as a wet floor sign, will face challenging times when the robot does not have a human-level understanding of the environment where objects are merely obstacles, feature points without any semantic meaning, or categorized as similar objects because some objects are never seen before.

1.1 Approach

To expand our knowledge on how a robot can help people to infer past missing causal information with these challenging scenarios in mind, we designed three communication modalities: physical replays, causal and projection marker. We evaluated them by conducting an experiment online through Prolific [70]. Participants watched videos of a Fetch robot [86] replaying its past actions, speaking or projecting key causal information (or combining them) in a collaborative mobile kitting task, consisting of both manipulation (picking and placing) and navigation subtasks.

scene — **Figure 1.** A mobile manipulation task environment in which we investigated how could the robot provide indicators to past missing causal information. The robot is supposed to pick different gearbox parts, including the gearbox bottoms on the table it is facing, take them to the caddy table on the left, and deliver the caddy to the bottom right table for an assembly worker to assemble a gearbox.

In the task (Figure 1), the Fetch robot is supposed to help a worker to assemble gearboxes: the robot picks one of the gearbox bottoms on a table, navigates to a caddy table, and places the gearbox bottom into a caddy. The caddy has three compartments: one big rectangular section and two small square sections (See Figure 1 left). The robot has to put the gearbox bottom into the bigger compartment because the object does not fit into any of the other two small compartments. The task was originally designed for the FetchIt Mobile Manipulation Challenge and there are different parts for the robot to pick. For more details, please see our previous paper [39].

In the experiment, we considered three highly motivated scenarios where a robot needs to replay its past behavior to help people to infer missing information because the robot’s explanations are unexpected – inconsistent with what participants observe. The scenarios below are narrated from the worker’s perspective and presented to participants throughout the experiment:

(1) Picking Failure Scenario. “As shown below (Figure 1), a robot is helping you to assemble gearboxes. It can drive itself, pick, and place a gearbox bottom into a caddy. One day, you leave your work area for a few minutes. When you return, you notice there are still 2 gearbox bottoms on the table. You ask the robot if it just picked up a gearbox bottom, it says yes. You are confused, and ask the robot to replay its past behavior. Then you see the robot was grasping a large wood chip torn up from the tabletop.”

“In the video, the robot replays its past behavior: picking up a large wood chip that the robot thought is a gearbox bottom. At replay time, the wood chip is gone already. Can you figure out where the large wood chip was before?”

(2) Navigation Scenario. “One day, another worker nearby told you the robot didn’t go straight to the caddy table. But you see the robot every day, and it does go straight to the caddy table every time. You are confused, and ask the robot to replay its past behavior. Then you see the robot was actually avoiding an obstacle on the ground.”

“In the video, the robot replays its past behavior: driving itself to the caddy table. At replay time, the obstacle on the ground is already gone. Can you figure out where the obstacle on the ground was before?”

(3) Placing Scenario. “One day, you hear a gearbox bottom dropped onto the floor behind the caddy table. You ask the robot if it just put a gearbox bottom into a caddy, and it says yes. You are confused, and ask the robot to replay its past behavior.”

“In the video, the robot replays its past behavior: putting a gearbox bottom into a section into the caddy. At replay time, the gearbox bottom is already gone. Can you figure out which section of the caddy that the robot tried to put the gearbox bottom into?”

As seen from the questions, there are three missing pieces of causal information due to environmental change which the robot should indicate at replay time.

(1) For picking, because a large wood chip was recognized as gearbox bottom and the chip is gone, the robot needs to indicate where it has thought the gearbox bottom was.

corner as screw — **Figure 2.** The failure scenario that inspired the first picking scenario: A Fetch robot misrecognized a torn up wood chip near the top-right corner of the table as a screw.

This is inspired by a real-world failure, shown in Figure 2, where the robot once misrecognized a wood chip at a table corner as a smaller screw. However, as the screws were placed inside a container and can only be easily seen if looked at near the container (which is the view angle of Figure 2), we used the larger gearbox bottoms which are placed directly on the table, as shown in Figure 1. Note that the larger gearbox bottoms were also in the original FetchIt Mobile Manipulation Challenge task.

scene original nav — **Figure 3.** The original scene where the robot avoided a ground obstacle, the yellow wet floor sign, to navigate to the caddy table to the left. At replay time, the yellow wet floor sign is gone. Key videos frames for the replay video without the sign is shown in Figure 6.

(2) For navigation, the robot has to take a seemingly non-optimal navigation path because there was a ground obstacle – a wet floor caution sign – between the two tables during navigation (See Figure 3).

(3) For placing, the robot should indicate which caddy compartment the gearbox bottom was placed into, so a human can understand that a gearbox bottom does not fit a small caddy section so the gearbox bottom slipped and dropped onto the floor.

To help people to discover the missing causal information, we manipulated seven methods, a combination of the three communication modalities, that the robot used to indicate relevant actions in the mobile kitting task. Our base condition was a physical movement replay (head, arm and base movement) where the robot would physically replay its past actions. We tested this base condition against conditions for three communication methods including speech, projection, as well as with both speech and projection. These three conditions were also tested in combination with physical replay.

In a questionnaire, we asked participants whether and when they have inferred the missing information – the locations of the misrecognized object, the ground obstacle, and the section of the caddy the object was placed into. Participants were also asked about their confidence in their inference answers, participants’ mental workload, and robot trust.

1.2 Applicability to Other HRI Applications

To the best of our knowledge, this is the first investigation into explaining key missing causal information of a robot’s past actions. We experimented with a combination of three communication methods: physical replay, speech, and projection mapping [43].

With generalizability in mind, we have situated our experimentation in a mobile manipulation task, so that the results would be potentially applicable to all tasks that have a manipulation or navigation component.

We anticipate our recommendations will be applicable to more domains, complimentary to the manufacturing scenario used int this paper. For example, the projection work may be similar to non-verbal eye-gaze cues in the social robotics domain (e.g., [3]). Being both illuminated and directional, the projection onto a robot’s environment is more explicit and thus may require less cognitive effort to infer the pointing destination.

Other examples of domains where our results could apply include applications of assistive mobile manipulation, e.g., assisting with people who with disabilities [21, 47, 50, 52, 68, 85], picking and delivering desired objects (e.g., [14, 51, 55]), and performing household chores (e.g., tidying up [1, 32, 75] and laundry [61]). Compared to eye-gaze, failures in assistive manipulation and mobility can physically damage the operating environment. In cases where the robot is not expected to fail, the three modalities investigated can be used to indicate where the robot will manipulate or navigate, allowing users to decide whether to choose an alternative, or to increase a user’s confidence in the robot assistive technology.

The application of robot navigation, an integral part of mobile manipulation, has a wide range of applications, especially with humans present [73]. Examples include security guard robots [4] and robots in public spaces such as the Pepper robot deployed in airports [34], delivery robots on the streets [87], warehouse robots [86], and robot dogs used in industrial settings [82].

While further work will be necessary to validate our findings in other robot domains, we have confidence that such applicability will be found for communicating issues regarding robot manipulation and robot navigation.

2 Related Work

2.1 Robot Explanations

There has been a surge of interest in robot explanation recently, partly due to the need to better interpret the widely-used, black-boxed deep learning models [2, 8, 38, 62, 71].

In the human-robot interaction community, most research has focused on explaining robot’s current behaviors. Hayes et al. [48] proposed explaining robot controller policy by allowing robot developers to manually annotate functions in code for robots to explain in an automated assembly line. Chakraborti et al. [20] mathematically formulated a set of requirements to robot explanations, such as completeness, conciseness, monotonicity, and computability, to align humans’ mental models with the robots’. Das et al. [24] used an encoder-decoder deep learning model to generate robot failure explanations that consider causal knowledge from the current environment, i.e., the context; a user study showed participants preferred explanations with context. Stange and Kopp [76] investigated robot explanations on undesirable behavior, i.e., walking between people and TV while people watch TV. Results show that both stating the action and explaining the robot’s intention and need increased understandability and desirability. Zhu and Williams [88] examined robots giving explanations before robot actions. Results show that providing these proactive explanations improved trust to robots. Han et al. [40] proposed a set of algorithms using behavior trees to generate hierarchical and failure robot explanations for a robot’s current behaviors. While a recent research [42] showed that non-verbal cues may not be enough for people to fully understand robots, there has been a considerate amount of research in HRI on non-verbal behaviors, such as eye-gaze (notably [3, 63]) and legible arm movements (e.g., [31, 56]). For a comprehensive review on non-verbal cues, please refer to [18].

In the human-agent interaction (HAI) community, one notable work was done by Ofra et al. [6], who trained an encoder-decoder model on videos and annotated descriptions to generate explanations for a Pac-Man’s turning moves; A human-subjects study [5] showed that participants preferred explanations generated from expert-provided annotations rather than non-experts’ annotations.

As human explanations [59] have been studied extensively in psychology, psychology researchers have also been contributing to the HRI field, with a focus on how people perceive robot explanations or the ethics rather than implementing explanations. De Graaf and Malle [25] summarized how humans explain behavior and proposed that robots should generate explanations within the conceptual and linguistic framework of human behavior explanation. The same authors [26] later published results from an experiment showing text explanations to participants, which confirms their proposal but with slight differences, one of which is that people think of robots as rational rather than motivational-affective entities like humans.

The existing work assumes that the robot explains its behavior as it is currently executing. In this work, we focused on explaining missing causal information from a robot’s past actions. This presents new challenges as the objects or the context in the robot explanation is not complete.

Other than papers, inside the human-robot interaction community, robot explanations have been centered in workshops, a special issue of a journal, and the theme of a symposium. In the 2018 ACM/IEEE International Conference on Human-Robot Interaction (HRI), the workshop “Explainable Robotic Systems” [27] was held. In the 2020 HRI conference, the workshop “Assessing, Explaining, and Conveying Robot Proficiency for Human-Robot Teaming” [77] was held. The 2020 AAAI Artificial Intelligence for Human-Robot Interaction Symposium (AI-HRI) [9] set “Trust & Explainability” as its theme. The ACM Transactions on Human-Robot Interaction journal has published a special issue [28] in July 2021 on “Explainable Robotic Systems”.

2.2 Observational Learning

To assist our investigation on how a robot can help people to infer past missing causal information, we looked for inspiration from the observational learning field from psychology. The field focuses on human cognitive evaluations of observed action sequence of humans, similar to the robot task sequence methods (i.e., action sequence representation [66]) studied in robotics and human-robot interaction. In Section 4.7 about our implementation, we have used Behavior Trees to represent the robot’s action sequence and, in Study Condition 2, discussed how the selected causal communication at specific actions are inspired from the findings in observational learning.

Observational learning itself is defined as being “concerned with the acquisition of attitudes, values, and styles of thinking and behaving through observation of the examples provided by others” [11]. Applied to robots that communicate explanations with humans, we can get insight into how portions of action sequences a robot should choose to communicate to others, so others will understand the most important information that the robot attempt to conveys, to leave a greater impression.

One subfield in observation learning studies how children imitate action sequences, i.e., behaving, from others [16]. There are two schools of thought: (1) children are rational learners (rational imitation [67]) who reproduce or imitate the most salient causal actions to an outcome; (2) children tend to overimitate others [49, 58], copying unnecessary, non-causal yet normative behavior in a social and cultural setting [53].

For over-imitation, it has been validated that children and adults both over-imitate a stranger’s actions even when they are not aware of participating in an experiment [83]. It is worth noting that the experiment was conducted in a real-world setting to avoid the adult participants being sensitive to a laboratory environment [17].

The first school of thought, the rational-imitation paradigm, has also been studied with human adult participants. Buttelmann et al [17] replicated the head-touch demonstration experiment with eye-tracking equipment, where a model head touches the lamp when hands are occupied versus touching with hands when hands are free. The results show that there are no significant differences between adults and infants in attention, measured by the amount of time spent looking at the modeled action. The only significant difference is that adults look at the model’s head in the video demonstrations significantly longer than infants (around 1 minute and 20 seconds), while no difference is found for looking at the torso (which takes only a few seconds).

In our work, we are particularly interested in the first paradigm, rational-imitation, partly because a recent study [74] found that children overimitate robots less than humans due to the lack of social motivations.

On the model side, researchers showed that the demonstrator’s intentional and knowledge state will be used to aid the causal inferences [35]. In [35], results suggested that intentional actions with verbal markers (e.g., “There”) are assumed by children that the model acted purposefully in order to reach a goal. Thus the intentional actions help to understand causality. Inspired by this, we incorporated verbal markers into some of our conditions.

Regarding video studies, televised models showing an action sequence are not rare and have been proven to be effective at an early age. In a study [65], 10- and 12-month-old infants are able to learn and imitate negative and positive emotional reactions from the televised model. Another study showed that 12-month-olds were also able to perform rational head-touch imitation during constrained conditions (i.e., hands-occupied), compared to hand-touch, also from a televised model [89]. During approximately the same year, researchers found that 2-year-old toddlers could learn verbal labeling during an action sequence with repetition from television after a 24-hour delay, and both video parent labeling and video voice-over labeling did not differ from live parent labeling [12]. In another televised experiment, results show similar findings that toddlers can use acoustical action effect from both a live mode and televised model [54]. This line of work suggests that the action sequence in a video accompanied by the acoustical action effect should be on par with a live demonstration. Inspired by the acoustical effect, which we interpret it as salient, we became creative and explored the projection mapping method in addition to verbal markers, because projection mapping is not a communicative method that humans use and projections can be seen as being salient.

3 Hypotheses

Driven by the observational learning work in Section 2.2 and particularly that demonstrator’s intention aids causal inference, we formalized the following six hypotheses. The first three are task-based metrics to explore the effectiveness and efficiency of the verbal and projection markers to discover past causal missing information. The remaining three are team-oriented subjective workload and trust metrics, exploring how people would perceive those markers that help inference-making.

3.1 Task-Based Metrics

Hypothesis 1 (H1) – Effective causal inference with verbal markers; Adding verbal markers to relevant actions in a robot’s action sequence will help people to effectively infer the causality of the robot’s behavior, measured by subjective measures about the three missing causal information: where the gearbox bottom was, why the robot did not choose the straight path, and which caddy compartment the gearbox bottom was placed into. This hypothesis also tries to validate the human intentionality study in [35] that intentional actions with verbal markers, such as “here” and “there”, help to understand causality.

Hypothesis 2 (H2) – The same effectiveness of causal inference with projection markers vs. verbal markers; Adding projection markers to relevant actions will have at least the same effectiveness to infer the missing causal information as verbal markers. The measure for this hypothesis is the same as H1.

Hypothesis 3 (H3) – Faster and more accurate causal inference with projection markers; Adding projection markers will make the causal inference faster than verbal markers, measured by when participants find the three pieces of missing information. As projection directly provides the causal information to the robot’s operating environment, we expect projection is at least the same as verbal markers in terms of causality inference.

3.2 Team-Based Metrics

Hypothesis 4 (H4) – The same workload in both verbal and projection conditions; Having projection markers will have the same amount of workload for the inference tasks, measured by subjective measures.

Hypothesis 5 (H5) – A robot is more trustworthy with projection markers; Because projection markers have the potential to infer missing causal information faster, we believe people will trust the robot more, measured by subjective measures.

Hypothesis 6 (H6) – There will be less workload when both verbal and projection markers are presented; Inspired by the multiple resource theory by Wickens [84], we believe that when more channels are used to convey the causal information, it makes the inference easier, and thus requires less workload. The measures will be subjective and the same as H4.

4 Experiment Design

The experiment followed a between-subjects design. In each condition, different participants watched three videos of a mobile manipulation task and completed a survey.

4.1 Task

In the videos, the robot replayed three subtasks in a mobile kitting task: it first tried to pick a gearbox bottom, navigated to the caddy station, and placed the gearbox bottom into a caddy.

As mentioned earlier, the robot navigated in a detoured rather than a straight path during the replay as there was a wet floor caution sign along the way. Additionally, the robot replayed its gearbox bottom grasping without any gearbox bottom in its hand because it treated a large wood chip torn from the table as a gearbox bottom. The large wood chip was already gone at replay time. In addition to these two missing causal information, because no gearbox bottom was physically placed into a caddy compartment, the compartment position was also missing.

At replay time, we used verbal markers and projection markers to indicate the missing causal information: which gearbox bottom the robot was grasping, why the robot detoured during navigation, and which caddy compartment the gearbox bottom was placed into.

The mobile kitting task was performed in an enclosed arena. For participants to see the whole arena and the ground obstacle projection in the videos during the picking and navigation replays, we set up a camcorder on a tall tripod sitting on a table at the near right corner outside of the arena. Specifically, the distance from the lens of the camcorder to the floor was around 1.9 meters (7 feet 8.5 inches). The placement was intentional to cover the wide field of view of a human’s eyes. In the placing replay, we placed the camcorder to the left of the robot outside of the arena to get closer to the caddy table.

4.2 Study Conditions

Table 1. Seven conditions with or without different communication strategies (Section 4.2)

¹The conditions with replay will be referred as replay conditions, while those without replay will be referred as non-replay conditions. Replay means physical arm or base movement replay.

	Baseline	With *speech*	With *projection*	With *projection & speech*
With *replay*¹	Replay	Replay-Say	Replay-Project	Replay-Project-Say
Without *replay*¹		Say	Project	Project-Say

There are seven conditions in this experiment (Table 1), designed to show different approaches to indicate missing causal information during the replay of the robot’s past actions. As we describe each condition below, Figure 4 to 8 show the key video frames from the videos of the conditions.

(1) Replay. During this condition, the robot replayed all actions in the action sequence of the task without any verbal or projection indication: all head, arm, and wheel movements. No explicit indications of causality were expressed by the robot, making this condition a baseline condition. We have this also because this is very similar to introspection which requires humans to investigate thoroughly.

Table 2. Causal Verbal Markers And Their Timing (See Study Condition 2 for the Rationale)

Scenario	Speech	Timing (Replay-Say and Project-Replay-Say conditions)*
Picking	“Ok. I picked up a gearbox bottom from here.”	Robot’s gripper is over the target object (Third photo in Figure 4).
Navigation	“Ok. I didn’t go straight to the caddy table because there was something on the floor in front of me on my left.”	Before robot starts driving itself (First photo in Figure 6).
Placing	“Ok. I placed the gearbox bottom into the near right section of the caddy.”	Robot’s gripper is over the compartment (Second photo in Figure 7).
*Because the Say and Replay-Say conditions do not have any physical arm and base replay, the robot spoke right after its perception actions, either for tabletop or ground objects.

(2) Replay-Say. In addition to replaying all actions, the robot speaks regarding the missing causal information during relevant actions to indicate the location of the gearbox bottom, the ground obstacle, and the caddy compartment, as shown in Table 2. The earliest relevant actions are chosen to avoid ambiguity. We have chosen simple words in the speech for participants to easily understand.

To activate verbal markers, we used the task-relevant heuristics to indicate the causal information and maximize influences to the attention process, which is one of four serial processes that observational learning depends on [10]. These heuristics are informed by research on instructional features [72] in observational learning and are commonly known as learning points (also referred as codes) [78]. The heuristic used for picking and placing is to speak when the robot’s gripper reaches right above the manipulation target to both attract attention and avoid ambiguity. Specifically, the robot spoke, “I picked up a gearbox bottom from here”, to indicate where the robot picked. To indicate where it placed, it spoke, “Ok. I placed the gearbox bottom into the near right section of the caddy.” Note that this give more specific and direct information because the robot has to be unambiguous to inform participants which specific section of the three sections of the caddy the robot has placed into. For navigation, the robot speaks right after a navigation plan is computed and its base is about to start moving. All the robot speeches are listed in Table 2.

Note that the three speeches are not meant to be directly comparable, but rather reaches the same level of unambiguity and attention draw. They are chosen to give different missing causal information in the three tasks that involves different sets of objects: picking an object on a table, navigating around an obstacle on the ground, and placing an object into a section of a caddy that has three sections. It was also our hope to see if one inference result from one task would transfer to another task. We will discuss the influences this may have had on the different results.

The speech was generated using Google Cloud Text-to-Speech : WaveNet [69], specifically en-US-Wavenet-D. We lowered the speech speed to 85%, as suggested by [60], to counter the noise from the air conditioning system at the ceiling in our facility. Because the volume from the robot, Fetch’s base speakers was very low, we bought a 20W JBL FLIP 5 Bluetooth speaker and placed it behind the neck of the robot. We chose to buy a white one to match the Fetch robot’s primary color and, thus, it makes the speaker less noticeable.

(3) Replay-Project. The robot replayed all actions and, instead of speaking, projected the causal information during relevant actions, the perception result and manipulation target back to the operating environment during picking (Figure 5), arrows of its navigation path as well as the 2D projection of multiple spheres representing the laser scans of the ground obstacle (Figure 6), and a cubic projection to represent the space that a caddy compartment occupies (Figure 8).

This condition is interesting because projection is not a human capability, and we want to explore the use of projection mapping, being more salient, inspired by the action effects discussed in Section 2.2. As seen in Hypothesis 2, we expect it to have the same effect as the verbal markers.

(4) Replay-Project-Say. This condition includes both verbal and projection indicators. The combination is inspired by our previous study in [42], which shows verbal explanations is needed in addition to the non-verbal cues. It is also inspired by Multiple Resource Theory [84], which states that a cross-modal interface, using modalities that reside in two different channels, has advantages over an intra-modal interface, using modalities that reside in the same channel.

The first four conditions above have physical arm and base movement replay. The remaining three below do not have replay, to avoid replay being a confounding factor affecting participants’ responses.

(5) Say. This approach is to only use verbal markers to indicate the gearbox bottom, the ground obstacle, and the caddy compartment. This condition is the same as the second Replay-Say condition except for arm or base movements.

(6) Project. This condition is only using projection mapping to project the perception result back to the operating environment for the indication. This condition is the same as the second Replay-Project condition except for arm or base movements.

(7) Project-Say. This condition combines both verbal and projection indicators but without arm or base movements.

All the causal verbal and projection markers are the same throughout all conditions if the markers are present. 21 videos, three in the picking, navigation, and placement scenarios by seven conditions, are available on the authors’ website . In the survey, however, the videos were shown as embedded YouTube videos to avoid excessive buffering for participants in other continents, particularly Europe.

4.3 Questionnaire

To test the hypotheses, we asked participants to fill out a questionnaire with 7-point Likert-scale, free-form, and forced-choice questions. We list them below because the questions and the choices are important details that are useful to understand the results section, especially the context of the figures. We also believe this section will facilitate potential replication efforts. The questionnaire allowed us to measure

(1) six subscales on causality inference – effectiveness, efficiency through timing, and confidence,

(2) five subscales on task workload, adapted from the NASA Task Load Index [46],

(3) and four subscales on trust, using the Muir Trust scale [64].

4.3.1 Causality Inference Effectiveness, Efficiency & Confidence

These causality inference questions are asked after participants watched one of the three videos. Participants in every condition watched all the questions below.

pick place — **Figure 9.** Photo shown to participants to answer **where the robot picked**. The correct answer is “F”.

Manipulation Causality Inference & Confidence. “Where was the large wood chip that the robot tried to grasp before?” Figure 9 was shown. Options are “I don’t know” and seven choices from A to G. The correct answer is “F”.

Timing of Manipulation Causality Inference. – “When did you know the answer to the question “Where was the large wood chip that the robot tried to grasp”?” The choices are: “I never knew”, “Before its head started moving around”, “While its head was moving around”, “After its head stopped moving”, “Before its arm started moving”, “When its arm started moving”, “When its hand was over the table”, “When its hand was very close to the table before grasping”, “When it was grasping”, and “Other (Please elaborate)”.

nav obstacle position — **Figure 10.** To answer **where the ground obstacle was**, this photo was shown to participants. The correct answer is “Area D”.

Navigation Causality Inference. – “Which grid section was most occupied by the obstacle that the robot was trying to avoid before?” Figure 10 was shown. The options are “I don’t know”, “Area A”, “Area B”, “Area C”, “Area D”, “Area E”, and “Area F”. The correct answer is “Area D”.

Timing of Navigation Causality Inference. – “When did you know where the obstacle was?” The choices are “Before the robot started moving”, “When the robot was facing area A”, “While the robot was moving towards the caddy table in grid E”, “While the robot was moving towards the caddy table in grid F”, “While the robot was moving towards the caddy table in grid C”, “When the robot was in front of the caddy table”, and “Other (Please elaborate)”.

caddy labelled — **Figure 11.** To answer **which section of caddy the robot placed into**, this photo was shown to participants. The correct answer is “Section A”.

Placement Causality Inference. – “Which section of the caddy did the robot place the gearbox bottom into before?” A caddy photo with a labeled compartment is shown in Figure 11. Response choices include “Section A”, “Section B”, “Section C”, and “I don’t know”. The correct answer is “Section A”.

Timing for Placement Causality Inference. – “When did you know where the robot put the gearbox bottom into?” The options are “I never knew”, “Before its head started moving around”, “While its head was moving around”, “After its head stopped moving”, “When it started moving its arm”, “When its hand was over the caddy”, “When its hand was very close to the caddy before releasing the gearbox bottom”, “When it was releasing the gearbox bottom”, and “Other (Please elaborate)”.

4.3.2 Task Workload Measures

NASA Task Load Index. – We adapted the NASA-Task Load Index (NASA-TLX) [46] multidimensional scale to estimate workload for the inference tasks, which could be demanding cognitively. Specifically, we adopt the subscales of mental demand, physical demand, effort, performance, and frustration level. We decided to remove the physical demand as this study was conducted virtually and requires little physical activity; participants were allowed to finish the whole study and watch videos in a hassle-free manner at their own pace. The subscales and the options to these questions are listed below:

Mental Demand – “How much mental and perceptual activity was required to answer questions after watching the videos (e.g. thinking, deciding, calculating, remembering, looking, searching, etc)? Was the task easy or demanding, simple or complex, exacting or forgiving?” The options range from very low to very high.
Temporal Demand – “How much time pressure did you feel due to the rate of pace at which the tasks or task elements occurred? Was the pace slow and leisurely or rapid and frantic?” The options range from very low to very high.
Performance – “How successful do you think you were in answering the questions after watching each video?” The options range from very good to very poor (responses are reversed to be consistent with others).
Effort – “How hard did you have to work (mentally and physically) to accomplish your level of performance?” The options range from very low to very high.
Frustration Level – “How insecure, discouraged, irritated, stressed and annoyed versus secure, gratified, content, relaxed and complacent did you feel during the task?” The options range from very low to very high.

4.3.3 Trust Measures

Muir Trust scale. – We used the composite trust score by Muir [64] to test our trust hypothesis, H5. The Muir trust score is well-established and has been used widely in the HRI and robotics literature on the trust topic, including [7, 29, 30, 80, 81]. The subscales and their options are:

Predictability – “To what extent can the robot’s behavior be predicted from moment to moment?” The options are not at all, mostly not, somewhat not, neutral, somewhat, mostly, and completely.
Reliability – “To what extent can you count on the system to do its job?” The options range from very low to very high.
Competence – “What degree of faith do you have that the robot will be able to cope with similar situations in the future?” The options range from very low to very high.
Trust – “Overall, how much do you trust the robot?” The options range from very untrustworthy to very trustworthy.

4.4 Quality Assurance Questions

Finally, we asked several attention check questions to help us ensure participant attention to the experimental stimuli, similar to those used in Brooks et al. [15]. The questions are:

After watching picking videos – What is the color of the gearbox bottoms? The choices are “Blue”, “Red”, “Green”, and “Gray”. The correct answer is “Gray”.
After watching navigation videos – What is the color of the robot? The choices are “Mostly white”, “Mostly red”, “Mostly yellow”, and “Mostly green”. The correct answer is “Mostly white”.
After watching placing videos – How many robot(s) were in the video? The choices are from 0 to 3. The correct answer is “1’.

The choices to these attention check questions were displayed in random order. In addition, we also added a Google reCAPTCHA verification question at the beginning of the survey to avoid bots – scripts to automate question answering.

4.5 Procedures

The study was conducted on Prolific, a similar platform to Amazon Mechanical Turk for online participants recruitment. Participants entered the study via an anonymous link to a Qualitrics survey. Once started, participants were presented with informed consent information and the Google reCAPTCHA verification. After agreeing to participate and passing the verification, participants were presented with demographic questions and randomly assigned to one of the experimental conditions.

Before watching each video, participants were presented with the motivating scenarios and the prompt questions, as seen in the Introduction section. Then they watched the videos, embedded from YouTube, and answered questions on the same page. We also gave the YouTube links that open in a new tab to deal with potential technical difficulties; the text is “If the video doesn’t load, please click this YouTube link (The YouTube link is a hyperlink). It will open in a new tab/window”.

For those conditions where the videos have sound, we first showed a YouTube video with sound only and asked what they heard to ensure participants can hear the sound. This video is presented on a separate webpage as we found YouTube remembers watchers’ mute preference, and this can cause other videos on the same webpage to remain muted, even when the sound-only video is manually unmuted.

After watching all videos and answering relevant questions, participants are then asked to answer the trust and NASA Task Load Index questionnaire to finish the study. To record participants as complete, participants are redirected back to Prolific at the end.

The entire study took an average of 13.6 minutes to complete, with a median of 11.6 minutes. All participants are paid US $3.01 at an hourly rate of $9.50, estimated for 19 minutes by the experimenter before the study. The study was approved by the institutional review board (IRB) at the University of Massachusetts Lowell in the USA.

4.6 Power Analysis, Participants, and Participants Recruitment

We used G*Power 3.1.9.7 [33] to perform two a priori power analyses because we planned to run two types of hypothesis tests.

We first performed an a priori power analysis for “Goodness-of-fit tests: Contingency tables”. The parameters were: Effect size w = 0.5 for large effect size, 𝛼 error probability = 0.05, Power (1 – 𝛽 error probability) = 0.95, Df = 9 which reflected the number of fixed choices in our measures described in section 4.3. The output parameters in G*Power showed that the sample size to reach desired power 1 − 𝛽 = 0.95 was 84 for a single goodness-of-fit test. Thus, for the seven conditions of our experiment design, we needed at least seven × 95 = 665 participants.

We also performed an a priori power analysis for “ANOVA: Fixed effects, omnibus, one-way” tests. The parameters were: Effect size f = 0.4 for large effect size, 𝛼 error probability = 0.05, Power (1 – 𝛽 error probability) = 0.95, and Number of groups = 7, reflecting the number of independent conditions in our study. The output parameters showed that the total sample size needed was 140.

Thus our study would need approximately 𝑁 = 665 participants to be sufficiently powered for both types of statistical tests.

Using Prolific, we recruited a total of 691 participants, with only 25 (3.6%) of them failing the quality check questions. The randomizer feature in Qualtrics is used to ensure evenly presented condition assignment. This resulted in 666 valid cases with one extra participant in the Replay-Projection condition, which might be caused by timed-out (Prolific automatically calculated 65-minute as the timed-out threshold from the 19-minute of estimated time of completion; See Prolific support: How long will your study take to complete?) participants who is replaced by Prolific by another participant taken one of the other conditions. To ensure that we had an equal number of participants in each of the seven conditions, we trimmed the data from the extra participant. This procedure resulted in a sample size of 𝑁 = 665 with 95 participants in each of the seven between-subjects conditions.

The final sample of 665 participants includes 267 females, 391 males, 5 non-binary, and 1 transgender person. Their age ranges from 19 – 88, 𝑀 = 31.8, 𝑚𝑒𝑑𝑖𝑎𝑛 = 28.0.

Specified qualifications for participation on Prolific included being over 18 years old, fluent in English, which provided a reasonable assumption of English language comprehension, having taken part in 100 – 10,000 studies on Prolific, and a 100% approval rating (Prolific uses the upper bound of the 95% confidence interval to calculate approval rate). Each participant, whether or not they passed data quality assurance checks, was paid for their participation, although as noted above, the data for people who failed data quality assurance checks was removed from our analysis.

4.7 Implementation

The action sequence in the replay is implemented using ROS and Behavior Trees [22]. For more details on how we modeled the mobile kitting task in Behavior Trees, please see [40].

For the replay, we recorded relevant ROS topics to a MongoDB database for its schemaless feature (no need to create a table for each ROS topic data type) and querying capabilities, including arm movement, neck, and eye camera movement, as well as wheel movement. At replay time, these topics are queried from the database and streamed back to ROS to exactly replicate the movement that happened at record time.

As projection markers are used to show perception, manipulation, and navigation intents, the timing choice is rather simple and static in terms of implementation. The robot is programmed to project right after objects are recognized, immediately after the object target to be grasped is determined, and right after a navigation plan is computed and its base is about to start moving.

To activate verbal markers, we used the task-relevant heuristics to maximize influences to the attention process, as previous discussed in Study Condition 2.

The heuristic used for picking and placing is to speak when the robot’s gripper is above the manipulation target to both attract attention and avoid ambiguity. For navigation, the robot speaks right after a navigation plan is computed and its base is about to start moving. All the robot speeches are listed in Table 2.

A comprehensive account can be found in [44]. The projection mapping implementation is detailed in [43] (tabletop projection) and [41] (navigation path projection).

5 Results

We used R to analyze the data. As we have seven conditions and a considerable number of choices in our multiple-choice questions, we decided to annotate relevant figures with pairwise statistical test results, including the p values and significance levels. One to four asterisks (*, **, ***, and ****) indicate 𝑝 < 0.05, 𝑝 < 0.01, 𝑝 < 0.001, and 𝑝 < 0.0001 respectively. The abbreviation of 𝑛.𝑠. denotes “not significant” statistically. When discussing pairwise results, the statistics included are all of statistical significance; any non-significant results discussed are explicitly mentioned as such.

For 7-item Likert responses, we coded them -3 to 3, from the least to the greatest extent. For example, “very unsure” would be -3 while “very confident” is coded 3.

This section is hypothesis-oriented, and we structure each hypothesis discussion by the three different tasks, picking, navigation, and placement. For readers who would prefer to jump to summarized recommendations for specific subtasks, please see Section 6.6.

5.1 H1. Effective causal inference with verbal markers (partial support)

To see whether verbal markers are effective for aiding participants in causal inference, we first analyze the responses to the inference questions for picking, navigating, and placing subtasks.

For all inference types, including picking, navigation, and placing, we were able to find statistically significant results from proportion tests. We first ran chi-square goodness-of-fit tests on the conditions and responses to all the multiple-choice inference questions, which reveals statistical significance (𝑝 < 0.0001) for each type of inference. Post-hoc binomial tests with Holm-Bonferroni correction for pairwise comparisons were performed and revealed significant differences for each type as well.

pick infer bar — **Figure 12.** Manipulation inference responses. “F” is correct. Replay conditions perform the best: nearly all participants were correct. Half wrongly selected nearby E in Say. Project and Project-Say only have half participants correct. Significant differences are revealed by chi-square goodness-of-fit tests.

nav infer bar — **Figure 13.** Navigation inference responses. The correct answer is “Area D”. Most participants were correct in all conditions except for the Say condition, in which only 40% were correct and half selected nearby Area E. Significant differences are revealed by chi-square goodness-of-fit tests.

place infer bar — **Figure 14.** Placement inference responses. “Section A” is the correct answer. Around 60% participants could infer correctly except for Project, in which no statistically significant results were found.

5.1.1 Picking Inference

For the picking inference responses in Figure 12, there are statistically significant differences in all responses across all replay conditions (the top four subfigures), where almost all participants correctly inferred that F is where the robot has picked.

In the Say condition, statistically significant differences were found for choices A, C, E, and G. Around half of the participants (47, 49.5%) selected the nearby E, while the correct answer, F, was not of significance. For the other two non-replay conditions, Project and Project-Say, around half of participants have the correct inference (48, 50.5%participants for Project and 53, 55.8% for Project-Say).

The results suggest that, without the physical replay of the head and arm movements, participants had difficulties in the picking inference. Particularly, with verbal indications alone, participants had an even harder time inferring the correct picking location, F, because they chose the nearby E.

5.1.2 Navigation Inference

For the navigation inference responses in Figure 13, there are statistically significant differences in almost all responses across all conditions. Except for the Say condition, 78% – 93% participants were able to infer the correct answer that Area D was where the ground obstacle was (Replay: 79 participants, 83%; Replay-Say: 75, 78.9%; Replay-Project: 88, 92.6%; Replay-Proj.-Say: 87, 91.6%; Project: 89, 93.7%; Project-Say: 79, 83.2%). For the Say condition, only 40.0% of participants were able to infer correctly, while 49.5% of participants chose the nearby E.

Similar to the picking inference, half of the participants may have interpreted right as far-right, where Area E is, in the robot’s indication for the ground obstacle location.

5.1.3 Placement Inference

For the placement inference responses in Figure 14, we were able to find statistically significant results in almost all choices (except for Section B) across almost all conditions except for the Project condition. Around 60% of participants were able to infer correctly in conditions except for the Project condition (Replay: 56 participants, 58.9%; Replay-Say: 57, 60.0%; Replay-Project: 57, 60.0%; Replay-Proj.-Say: 64, 67.4%; Say: 54, 56.8%; Project-Say: 58, 61.1%).

The responses of Section B may happen by chance in all conditions except for the Replay-Say condition, in which there is only weak support by a 𝑝 < 0.05 significance.

In the Project condition, unfortunately, all choice responses may happen by chance (𝑛.𝑠.).

The placement inference results suggest that the effectiveness of verbal indications is supported.

5.1.4 Conclusion

From the analysis of the responses in the Say condition in the three different scenarios above, we can conclude that H1 on the effective causal inference with verbal markers is only partially supported: H1 is only supported in the placement scenario, but not in the picking and navigation scenarios.

With only verbal markers to infer the picking location, half of the participants wrongly inferred the nearby wrong location of E but not the correct location of F (See Section 5.1.1 & Figure 12 fifth row). In the navigation inference, around 60% could infer where the robot placed the gearbox bottom (See Section 5.1.3 & Figure 14 fifth row green bar). For the inference of where the ground obstacle with only verbal markers (See Section 5.1.2 & Figure 13 fifth row), half participants, similar to the picking scenario, inferred the wrong location of Area E.

5.2 H2. Effective causal inference with projection markers (partial support)

As we explore the answer to H1, we can also test whether projection markers are effective.

For picking inference, shown in Figure 12, the project indications alone (Project) and the one with verbal indications (Project-Say) are only 50% effective compared with those conditions with physical replay. However, like what we have discussed H1, projection indicators are more effective than verbal indicators when they are the only cues being presented by the robot, because half participants were wrong in the verbal-only condition of Say.

In terms of navigation inference, shown in Figure 13, all conditions with projection indications (Replay-Project, Project, Project-Say) are remarkably effective – the majority of participants were able to infer the location of the ground obstacle. Compared to verbal indicators alone, projection indicators alone are one time more effective.

Regarding placement inference, shown in Figure 14, projection indicators with either physical replay (Replay-Project) or verbal indicators (Project-Say) are at least the same as other non-projection conditions in terms of effectiveness. However, all the responses to the placement inference questions under the Project condition may happen by chance (𝑛.𝑠.), so we are not able to conclude the effectiveness of projection indicators alone.

Thus, H2 is partially supported. Conditions with projection indicators are remarkably effective for inferring the location of ground obstacles but are only half effective for picking inference when the projection is the only indicator present. With insignificant results in the placement inference responses from Project, their effectiveness remains unknown.

5.3 H3 – Efficiency. Faster causal inference with projection markers (partial support)

To check whether projection markers lead to faster causal inference, we analyzed participants’ responses to the timing questions. We did not have this type of question for the Project condition in the navigation because ground projection was always on throughout the inference video, thus no early or later events were present. However, we can still gain insight into whether projection indicators alone had an effect when analyzing the responses from participants who experienced the Project-Say condition in its navigation video.

5.3.1 Picking Inference Timing

pick timing bar — **Figure 15.** Participants’ responses to when they have inferred the picking location. The Say condition (fifth row) performs the best with 60+ participants who inferred early but 20+ never knew. Project and Project-Say (last two rows) are at the second tier with fewer participants who inferred early and more unknowing participants. In all replay conditions, the top four in the figure, participants reported they know at a later event.

For picking inference timing, we first conducted a chi-square goodness-of-fit test on the responses to the picking inference timing question across all conditions, and it reveals a statistically significant result (𝑝 < 0.0001). Then we ran post-hoc binomial tests with Holm-Bonferroni correction for pairwise comparisons, the results are annotated on Figure 15.

Comparing the Project condition with the replay conditions, approximately the same number of participants (∼35, 36.8%) in the Project condition chose “After its head stopped moving” (in green) as the number of participants in all replay conditions chose “When its hand was very close to the table before grasping” (in violet).

While this means the Project condition accelerates picking inference, 40 participants (42.1%, the most in all conditions) reported that they never knew the answer in the Project condition. In general, this happens in all non-replay conditions, as seen in the red bars of the bottom three subgraphs in Figure 15: 21 participants (22.1%) in Say, 40 (42.1%) in Project, and 27 (28.4%) in Project-Say selected “I never knew”. They are all of statistical significance.

We also analyzed the conditions where the projection indicators are accompanied by replay and verbal indicators.

Adding replay (Replay-Project) or both replay and verbal indications (Replay-Project-Say) also accelerates picking inference. Seen from the purple and more saturated blue bars in the third and fourth replay subfigures (Replay-Project and Replay-Project-Say) in Figure 15, the number of participants who inferred “when it was grasping” decreased compared to the Replay and Replay-Say conditions that do not have any projection indicators (top two subfigures in Figure 15). And the earlier event “When its hand was over the table” has more participants than the Replay and Replay-Say conditions. The statistics for those two events are both of significant difference.

However, when adding verbal indicators to projection (Project-Say) or in the Say condition, fewer participants chose “I never knew”. Instead, 32.6%, 31 more participants (66.3% vs. 33.7%, 63 vs. 32) in the Say condition and 17.9%, 17 more participants (51.6% vs. 33.7%, 49 vs. 32) in the Project-Say condition inferred the picking location after “After its head stopped moving” than in the Project condition (See the green bars in Figure 15).

Thus, we can conclude that adding projection markers to physical replay makes participants’ picking inference faster, but not when included in verbal indicators. In terms of the maximum number of participants who infer early, the verbal condition performs the best but many participants (21, 22.1%) reported that they never knew where the robot picked. To ensure almost all participants infer the picking location, projection with replay (Replay-Project) and projection with both replay and verbal indicators (Replay-Project-Say) are two better choices that are approximately the same.

nav timing 4 replay bar — **Figure 16.** Participants’ responses to when they have inferred where the ground obstacle was in replay conditions. In summary, Replay-Project-Say performed the best, with 20+ participants inferring at the earliest event: before the robot started moving. (Non-replay conditions, excluding the Project condition where projection was always on, had their own options as the robot’s base did not move during these conditions; See Figure 17 and 18.)

5.3.2 Navigation Inference Timing

For navigation inference timing, we conducted the same type of chi-square goodness-of-fit test on the responses to the navigation timing questions in all conditions except for the Project condition where timing is irrelevant because the robot projects throughout the video. Post-hoc binomial tests with Holm-Bonferroni correction for pairwise comparisons are conducted when there is a statistically significant difference revealed by the goodness-of-fit tests.

For replay conditions where the robot’s base moved, the chi-square goodness-of-fit test shows a statistically significant difference across responses (𝑝 < 0.0001). Pairwise comparison results are annotated on Figure 16.

As seen from the figure, around half of participants who experienced the Replay and Replay-Say conditions (top two subfigures), 47 (49.5%) in Replay and 55 (57.9%) in Replay-Say, were able to infer the ground obstacle location while the robot was in grid E (fourth cyan bars in Figure 16).

When projection indicators are added (last two subfigures in Figure 16), the numbers dropped half to 28 (29.5%) in both Replay-Project and Replay-Project-Say conditions. Instead, 26 (27.4%) participants, approximately the same as the dropped number, inferred the ground obstacle location at an earlier event of “before the robot started moving” in the Replay-Project-Say condition (the first bar in the last subfigure of Figure 16). For the Replay-Project condition, the dropped number of participants are distributed to two earlier event “when the robot was facing area A”, ‘While the robot was moving towards the caddy table in grid F”, and a later event of “while the robot was moving towards the caddy table in grid C”, but unfortunately post-hoc binomial tests suggest that responses for the first and the third may happen by chance.

nav timing say bar — **Figure 17.** Participants’ responses in the Say condition to when they have inferred where the ground obstacle was. Most participants made the inference after the robot started speaking, more than any option in the replay conditions shown in the previous figure (Figure 16). (The options were only present in the Say condition as the robot did not move its base but just spoke.)

In the Say condition, binomial tests with Holm-Bonferroni correction suggest a statistically significant difference across all options. Results are shown in Figure 17. 78.9% participants (75) were able to make the ground obstacle inference “after it started speaking”. No participants reported that they made the inference “before it started speaking”.

The result here is consistent with the immediately previous finding that Replay-Project-Say accelerates the participants’ inference as the results suggest that verbal indicator is remarkably effective.

nav timing project say bar — **Figure 18.** Participants’ responses to when they have inferred where the ground obstacle was. More than 60% of participants made the inference from the ground projection. (The options were only present to the Project-Say condition as the robot did not move its base but made projection onto the ground and spoke.)

In the Project-Say condition, we performed the same statistical test as in the Say condition and results are shown in Figure 18. Sixty participants (63.2%) were able to know where the ground obstacle was “before it started speaking (At the beginning of the video, with projection)”. While only 30 participants (31.6%) reported they know it “after it started speaking“, the binomial tests with Holm-Bonferroni correction, unfortunately, suggest that this might just happen by chance. Nonetheless, comparing responses in the Say conditions to those in the Project-Say condition, more than 60% more participants could make the inference with projection alone.

Thus, we reach a mixed conclusion again, similar to the one for picking timing inference: Participants responded to adding projection markers and only projection markers differently. Including projection indications in replay condition with or without verbal indications, defers the inference from the middle of the path (“While the robot was moving towards the caddy table in grid E”) to leaving for the caddy (“While the robot was moving towards the caddy table in grid F”) and arriving at the caddy table (“While the robot was moving towards the caddy table in grid C”). With projection indications only, more than 60% of the participants were able to make the inference from the projection before it started speaking, which is suggested by analyzing the responses in the Project-Say condition.

place timing bar — **Figure 19.** Participants’ responses to when they have inferred which section of the caddy the object was placed into. Participants in the Project condition made the earliest inference after the robot’s head stopped moving. However, 30+ participants reported they never knew. Say and Project-Say has the top performance because the participants elaborated on what they knew from the robot’s speech. For replay conditions, arm movement significantly delays the inference.

5.3.3 Placement Inference Timing

Finally, we analyzed the responses to the placement inference timing questions. The same as other analyses in this section, we ran a chi-square goodness-of-fit test and it revealed a statistically significant difference (𝑝 < 0.0001). Results from post-hoc binomial tests with Holm-Bonferroni correction for pairwise comparisons are shown in Figure 19.

Again, the Project condition has the most participants (50, 52.6%) who were able to infer the placement position “After its head stopped moving” (the green bar in the second last subfigure of Figure 19) while only 36 and 40 participants (37.9% and 42.1%) in the Say and Project-Say conditions were able to do so after the same event. For replay conditions, fewer than eight participants in all statistically significant events that happen before its hand over the caddy were able to make the inference.

However, similar to the responses to the picking inference question, the Project condition had the most participants (31, 32.6%) who chose “I never knew” (pink bar). Only a few participants chose this option in the Replay, Replay-Project, and Replay-Project-Say conditions. The responses to this option in other conditions, i.e. Replay-Say, Say, and Project-Say, are not statistically significant and may have just happened by chance.

For replay conditions, most responses are distributed in three late events when the robot’s gripper is above the caddy to get closer and release the object: “When its hand was over the caddy”, “When its hand was very close to the caddy before releasing the gearbox bottom”, and “When it was releasing the gearbox bottom”.

It is worth mentioning that for the Say and Projection-Say conditions without any arm replay, 44 and 38 participants (46.3% and 40.0%) chose to elaborate on a different choice. Upon analyzing these responses, all participants said they knew when the robot said so, which is the same event as “After its head stopped moving”. Therefore, the Say and Project-Say are the fastest conditions for participants to infer the placement position. Compared to the replay conditions that we just discussed, the robot’s physical arm movement delayed the inference.

5.3.4 Conclusion

Thus, the efficiency aspect of projection indicators in H3 is partially supported. Projection indicators performed the best to help participants infer the ground obstacle as early as the obstacle is indicated by projection. However, for inferring picking and placing locations, while projection expedites the inference, it also prevents one-third of participants from making any inference, 30 to 40 participants respectively – 31.6% to 41.1%, as seen from the top bar (“I never knew”) of the second last row in Figure 15 and 19.

5.4 H3 – Accuracy. More accurate causal inference with projection markers (not supported)

To test whether projection markers make causal inference more accurate, we analyzed the nonparametric responses to participants’ confidence ratings in their answers to all three inference questions. More accurate methods should make participants more confident.

pick confidence bar — **Figure 20.** Participants’ confidence levels in their responses to the picking inference question (Solid lines indicate median values). Generally, participants in replay conditions are more confident than those in non-replay conditions (Confident or very confident in replay conditions vs. somewhat confident in non-replay conditions).

5.4.1 Picking Inference Confidence

For participants’ confidence in their picking inference responses (See Figure 20), we first conducted a Kruskal-Wallis H test and it reveals a statistically significant difference across conditions (𝜒²(6) = 240.02,𝑝 < 0.0001). We then run post-hoc Mann-Whitney U pairwise comparisons with Holm-Bonferroni correction. Results show there are significant differences between the Replay condition and the conditions of Say, Project, and Project-Say (all: 𝑝 < 0.0001), between Replay-Say and the conditions of Say, Project, and Project-Say (all: 𝑝 < 0.0001), between Replay-Project and the conditions of Say, Project, and Project-Say (all: 𝑝 < 0.0001), and between Replay-Project-Say and the conditions of Say, Project, and Project-Say (all: 𝑝 < 0.0001).

The results show that participants in replay conditions are more confident in their responses than non-replay conditions, i.e., very confident in replay conditions vs. somewhat confident in non-replay conditions.

nav confidence bar — **Figure 21.** Participants’ confidence levels in their responses to the navigation inference question (Solid lines indicate median values). Participants’ confidence levels in the Project and Project-Say conditions are increased to confident from somewhat confident in their picking inference. However, the statistical significances suggest that participants are still more confident in replay conditions (more right-skewed).

5.4.2 Navigation Inference Confidence

For participants’ confidence in their navigation inference responses (Figure 21), a Kruskal-Wallis H test shows that there is a statistically significant difference across conditions (𝜒²(6) = 79.125,𝑝 < 0.0001). Post-hoc Mann-Whitney U pairwise comparisons with Holm-Bonferroni correction show the statistically significant results between the Replay condition and the conditions of Say (𝑝 < 0.0001), Project (𝑝 < 0.0001), and Project-Say (𝑝 < 0.01), between Replay-Say and the conditions of Say (𝑝 < 0.0001), Project (𝑝 < 0.001), and Project-Say (𝑝 < 0.05), between Replay-Project and the conditions of Say (𝑝 < 0.0001), Project (𝑝 < 0.001), and Project-Say (𝑝 < 0.05), between Replay-Project-Say and the conditions of Say (𝑝 < 0.0001), Project (𝑝 < 0.001), and Project-Say (𝑝 < 0.05), and between Say and Project-Say (𝑝 < 0.01).

These results are statistically consistent with the data of picking inference confidence, plus a statistical significance between Say and Project-Say. It is worth noting that participants’ confidence raised to a median of confidence in Project and Project-Say from somewhat confident in picking reference (compare the last two subfigures in Figure 21 with the last two in Figure 20). Although the median value is increased, the statistically significant results suggest that participants in replay conditions are still more confident, as seen to be more right-skewed in the left four subfigures in Figure 21.

place confidence bar — **Figure 22.** Participants’ confidence levels in their responses to the placement inference question (Solid lines indicate median values). Participants in the Project condition has more unsure ratings, while other condition has more participants distributed in different confidence level ratings: somewhat confident (1), confident (2), and very confident (3).

5.4.3 Placement Inference Confidence

For participants’ confidence in their placement inference responses (Figure 22), unsurprisingly, Kruskal-Wallis H test shows a statistically significant difference across conditions (𝜒²(6) = 53.436,𝑝 < 0.0001). Post-hoc Mann-Whitney U pairwise comparisons with Holm-Bonferroni correction show the statistically significant results between the Replay condition and the conditions of Replay-Project-Say (𝑝 < 0.05), Project (𝑝 < 0.01), between Replay-Say and the conditions of Replay-Project (𝑝 < 0.05) and Project (𝑝 < 0.0001), between Replay-Project and the conditions of Replay-Project-Say (0.01) and Project (𝑝 < 0.05), between Replay-Project-Say and the conditions of Say (𝑝 < 0.05) and Project (𝑝 < 0.0001), and between Say and Project (𝑝 < 0.05). For the Project-Say condition, no statistically significant results were found.

These suggest that participants who experienced Project and Replay-Project projection methods are less confident than other non-projection replay conditions, i.e., somewhat confident vs. confident. It is worth noting that after adding verbal indicators to Replay-Project (Replay-Project-Say), participants become more confident. The Say condition itself and Project-Say also perform as well as the first three replay conditions because there is no statistical significance between each pair.

5.4.4 Conclusion

Thus, the accuracy aspect of H3 may not be supported through the confidence measure. With projection alone, it is not more accurate than other replay conditions because people are less confident in their inferences. Replay conditions perform well in almost all subtasks, while verbal indicators make participants as confident as replay conditions in placement inference.

5.5 H4. The same workload in both verbal and projection conditions (mostly supported)

To investigate whether participants bear the same workload in both verbal and projection conditions, we analyzed the Likert responses to the NASA Task Load Index questionnaire. Figure 23 shows a bar chart that visualizes the data.

load bar — **Figure 23.** Responses to the NASA Task Load Index questionnaire. Results from pairwise comparisons are shown and dashed lines indicate median values. In general, replay conditions and the Say condition performed the best; No statistically significant differences were found between the Say and Project conditions except for performance. See Section 5.5 for more details.

We ran Kruskal-Wallis H tests across the subscales, which reveal statistically significant differences for all: mental demand (𝜒²(6) = 42.3,𝑝 < 0.0001), temporal Demand (𝜒²(6) = 15.4,𝑝 < 0.05), performance (𝜒²(6) = 176,𝑝 < 0.0001), effort (𝜒²(6) = 31.1,𝑝 < 0.0001), and frustration Level (𝜒²(6) = 93.5,𝑝 < 0.0001). We then ran post-hoc Mann-Whitney U pairwise comparisons with Holm-Bonferroni correction and the results are in Figure 23.

Surprisingly, projection indicators alone had a median rating of somewhat high (coded as 1) in the need of mental demand. Statistically significant differences were found between the Project condition and each of the replay conditions (See column 1, row 6 in Figure 23). No significant differences were found between the Projection condition and the Say and Project-Say conditions.

For temporal demand, every condition was rated low to some extent: either somewhat low (coded as -1, in Project and Project-Say) or low (coded as -2, in all replay conditions and Say). No statistically significant differences between the Project condition and all other conditions were found except for the Replay-Project-Say condition.

In terms of participants’ performance, the Project condition is the only case not rated good to some extent – the median rating is neither good nor bad. Statistically significant differences were found between the Project condition and all other conditions, in which all replay conditions are rated as good to their performance (coded as 2) and the Say and Project-Say conditions are reported as somewhat good (coded as 1).

In the effort ratings, participants rated projection indicators only (the Project condition) as somewhat high (coded as 1). statistically significant results are found between Project and each replay condition.

For frustration levels, statistically significant differences were found between each replay and non-replay condition pair. Participants rated replay conditions as low frustration, Say as somewhat low, and Project as well as Project-Say neither low nor high.

In summary, replay conditions and the Say condition generally performed the best.

Thus, H4 is mostly supported. Participants experienced the same workload in the Say and Project conditions for four subscales of mental demand (neutral to somewhat high, 𝑛.𝑠.), temporal demand (low to somewhat low, 𝑛.𝑠.), effort (neutral to somewhat high, 𝑛.𝑠.), and frustration level (somewhat low to neutral). There is only one slight statistical significance (𝑝 < 0.05) between the Say and Project conditions: the performance subscale, somewhat high in Say vs. neutral in Project.

5.6 H5. A robot is more trustworthy with projection markers (not supported)

trust bar — **Figure 24.** Responses to the Muir trust questionnaire [64]. Regarding predictability, participants rated non-replay conditions less predictable. For reliability, when accompanied with replays or with both replays and verbal indicators, projection markers improve reliability. In terms of competence, adding replay to either Say or Project or both increases the competence rating. For the direct trust measure, replay conditions have more positive ratings than their non-replay counterparts. See Section 5.6 for more details.

We analyzed Likert responses to the four subscales to measure trust: predictability, reliability, and competence in addition to trust. The ratings are visualized in Figure 24.

Similar to the NASA Taskload questionnaire, we ran Kruskal-Wallis H tests across all subscales and statistically significant results are revealed in all of them: predictability (𝜒²(6) = 24.8,𝑝 < 0.001), reliability (𝜒²(6) = 23.9,𝑝 < 0.001), competence (𝜒²(6) = 49.8,𝑝 < 0.0001), and trust (𝜒²(6) = 30.7,𝑝 < 0.0001). Then we ran post-hoc Mann-Whitney U pairwise comparisons with Holm-Bonferroni correction. The results are also annotated in Figure 24.

In terms of predictability, projection and verbal markers only make the robot more predictable with replays or with both replays and verbal indicators. We found four pairs of statistically slightly significant differences, as seen in the first column of Figure 24, between {Replay-Project, Replay-Project-Say} and {Say, Project}.

When combining positive ratings (somewhat predictable, predictable, and very predictable; 1, 2, & 3), approximately 25% more participants reported that the verbal indications alone (Say) and projection indicators alone (Project) (50 and 51 participants – 52.6% and 53.7%) are less predictable than the Replay-Project or the Replay-Project-Say condition (75 and 74 participants – 78.9% and 77.9%) where projection or both projection and verbal indications are included. This suggests that the addition led to the differences. As a result, the number of participants who reported unpredictable decreased half from 34 and 32 (35.8% and 33.7%) in the Replay-Project and the Replay-Project-Say condition conditions to 16 and 15 (16.8% and 15.8%) in the Say and Project conditions.

In the reliability ratings, we reach a similar conclusion as predictability: projection markers, but not verbal markers, only make the robot more reliable with replays or with both replays and verbal indicators. Specifically, two statistically slightly significant differences were found between Replay-Project and Project, and between Replay-Project-Say and Project. Fewer participants reported reliability positively in the Project condition: 83 participants (87.4%) in both Replay-Project and Replay-Project-Say conditions had positive reliability ratings (row 3 and 4 in column 2 of Figure 24) while 22.1% and 28.5% fewer participants, only a total of 62 and 56 participants (65.3% and 58.9%) agreed that they can count the robot to do the job in the Project condition (second-last row in column 2).

For competence (third column in Figure 24), statistically significant differences were found between {Replay-Say, Replay-Project, Replay-Project-Say} and {Say, Project}, which indicate that adding physical replay to either Say or Project or both increases the robot’s competence from neutral to somewhat high. In addition, statistical significance also exists in Project-Say versus Replay-Project or Replay-Project-Say, all of which include projection indicators.

Comparing Replay-Project with Project-Say, in which physical replays replaced verbal indicators, 25 (26.3%) more participants positively reported the robot’s competence (74 in Replay-Project vs. 49 in Project-Say, 77.9% vs. 51.6%). As seen in Figure 24, this resulted in a more right-skewed distribution for competence in the Replay-Project condition (column 3, row 3) than the one for the Project-Say competence responses (column 3, row 7).

Comparing Replay-Project-Say with Project-Say (column 3, row 4 vs. row 7 in Figure 24), in which physical replay is added to projection and verbal indicators, we see almost the same effect (only two participants fewer) that 23 (24%) more participants positively rated the robot’s competence (72 in Replay-Project-Say vs. 49 in Project-Say, 75.8% vs. 51.6%).

For the more direct trust ratings, statistically significant differences were found between {Replay-Project, Replay-Project-Say} and non-replay conditions {Say, Project, Project-Say}. More participants reported “trust” (coded as 2) in Replay-Project (36 participants, 37.9%) and Replay-Project-Say (31, 32.6%) than non-replay conditions (15 – 15.8% in Say, 19 – 20% in Project, and 19 – 20% participants in Project-Say).

Thus, H5 is not supported. For the mobile manipulation task in the Projection condition without additional replay or verbal or both indicators, participants rated the robot as less trustworthy, less predictable, less reliable (less extent to predictability), and less competent to do the job.

5.7 H6. Less workload when presented both verbal and projection markers (almost not supported)

This last hypothesis can be tested the same way as H4 with visuals from Figure 23. For the responses in the Project-Say condition, where both verbal and projection markers are presented, only the reported performance is better than the Project condition, where participants reported that their performance is somewhat high versus neutral in the Project conditions.

For other statistically significant differences, Project-Say has higher mental demand (neutral) than the Replay and Replay-Project-Say conditions (Both were rated somewhat low) and has higher frustration level (neutral) than all replay conditions (All were rated low).

In the temporal demand and effort ratings, no significant differences were found.

Thus, H6 is almost not supported. Only in one single pair comparison, which is between Project-Say and Project in the performance metric, Project-Say has a better median rating (somewhat high vs. neutral).

6 Discussion

Surprisingly, all of our hypotheses about the use of verbal and projection markers alone were either partially supported (H1, H2, H3 – Efficiency, H4, and H6) or not supported at all (H3 – Accuracy and H5), although we believed that instant projection directly onto the operating environment would be more effective and efficient in terms of causal inference as well as mental workload.

In contrast, a combination of physical replays with the verbal and/or projection markers have shown better effectiveness and efficiency as well as lower workload and increased trust. This is consistent with the findings of our previous study [42] that physical arm movement should be accompanied by verbal explanations.

In this section, we will first discuss for each metric and end with overall recommendations (Section 6.6) that will help researchers and practitioners to utilize those results.

6.1 Effectiveness of inferring missing causal information from the past

With verbal indicators (Section 5.1), participants were able to effectively infer where the robot placed the gearbox bottom, at the same level as other conditions except for Project, as shown in Figure 14. However, the verbal indicators did not allow participants to infer where the robot picked the misrecognized object nor where the ground obstacle was during navigation. In both the picking and navigation inferences, half of the participants chose a place near the actual answer, as shown in Figures 12 and 13. Thus, verbal indicators using relative directional words should not be used alone for picking and navigation inference. Rather, arm movements with the grasping behavior should be used for picking as almost all of the participants in all replay conditions made the correct inference, as shown in Figure 12. Physical replay and projection, instead of speech alone, should be included for better efficiency in navigation tasks because almost all of the participants in all conditions, except for the Say condition, correctly inferred where the ground obstacle was, as shown in Figure 13.

Projection is remarkably good for navigation inferences (Section 5.2), on par with replay conditions, but only half of the participants were correct for picking inferences (see the bottom two rows in Figure 12) and no statistical significance was found for placement inferences (see the second last row in Figure 14). As a preview before looking at when the inference happens in Section 6.2, projection markers are faster if a person is looking for some causal inference during a robot’s navigation task (Green bar in Figure 18), because physical movements from the robot’s base take time to replay (Figure 16). For picking and placing inferences, projection and verbal markers should be present with arm movement (i.e., Replay-Project-Say) to accelerate the causal inference process (Compare last row to the other rows in Figure 16). Otherwise, these two indicators alone deteriorate people’s causal inference performance, shown from the last three rows in Figure 12 and the second-last row in Figure 14.

As shown in Figure 14, placement inference effectiveness is different because almost all participants in at least one condition in the other inference types (picking and navigation) were able to correctly infer the causal information (Figure 12 and 13), but only around 60 (63.2%) in all 95 participants did so for placement inference in every condition except for the Project condition, for which all responses were not significant.

The difficulty with placement inference might be that the gripper did not move into the compartment where the object was dropped. Not moving inside the square caddy compartment, compared with another manipulation subtask – picking, where the robot moved to the target object to be grasped – made it more difficult to infer which caddy section the gripper was above. Even with projection into the caddy section, we might see better results if participants were next to the robot, in person, to see the projection inside of the caddy, rather than viewing the projection on a recording at a fixed angle. The recommendation here is still to have a multimodal approach with physical replay, verbal and projection indications to inform people where an object is placed into. Yet, we see that additional information from extra physical arm movement might be useful: For example, instead of using an exact replay, an enhanced replay might be better, where the robot would move its gripper into the caddy section into which the object had been dropped, since it is harder for people to make the spatial inference over a set of concave compartments. Future work could investigate cases with placement in concave objects with the robot using additional physical indicators to improve inference-making of where the object was placed.

6.2 Efficiency: How fast to infer past missing causal information

As mentioned in Section 5.3, we used discrete timing events to measure efficiency, independent from effectiveness. So, one can make decisions on which causal indicators to use by analyzing both.

As shown in the top four rows in Figure 15, Most participants reported they inferred the picking location during the last three timing events after the robot’s arm moved close to the object. Participants in non-replay conditions, shown in the last three rows of Figure 15, made early inferences and reported before the robot’s head started moving, when the head points to the projection or when the robot starts speaking or both. However, 31 (32.6%) of the 95 participants in the Project condition reported they never knew, and there were 11 (11.6%) participants who never knew in each of the other two non-replay conditions, Say and Project-Say, although no statistical significance was found, shown from the top bar in the last three rows in Figure 19. This suggests that both eye-gazing cue and the verbal/projection indicators are not enough for picking inference; in practice, replay should be included in these cues.

From the analysis of participants’ responses to the navigation timing questions with three different responses (Figures 16 to 18), two-thirds of the participants were able to make the inference right after seeing the projection. (The other third had no statistical significance; see Figure 18). Verbal indicators alone are the best overall if we only consider statistically significant results (We did see non-significant results from responses, shown from Figure 17).

For the replay conditions, we saw interesting reactions from participants. While the performance in the Replay-Say condition is on par with Replay (top two rows in Figure 16), adding projection indicators delays the inference (the third row in Figure 16). However, with both verbal and projection indicators (Replay-Project-Say) the fourth row in Figure 16), 26 (27.4%) of the 95 participants reported that they knew before the robot started moving, while there were fewer than 10 participants indicating such when only one indicator type was used (as shown by the first bar of the middle two rows in Figure 16).

Therefore, our recommendation for making the navigation inference faster is to only use projection, as shown in Figure 18. This is consistent with previous research on navigation path visualization from other researchers (e.g., [19, 23]). If projection is not available, we recommend using verbal indicators without any replay, shown in Figure 17 rather than simply replaying the base movement.

For placing, we see the same pattern as picking. As shown in Figure 19, participants’ inference timing in Replay conditions (top four rows) is distributed to three later events after the robot’s gripper is close to the caddy. In the Project condition, there are 31 (32.6%) participants not knowing which caddy section the object was placed into, while 60 (63.2%) participants reported they knew after the robot’s head stops moving. The Say and Project-Say condition perform the best: after analyzing responses to the “other” option, ∼ 80 participants in both conditions were able to infer after its head stopped moving. So, in practice, to accelerate people’s placement inference, verbal indicators should be used.

6.3 Inference accuracy and confidence of past missing causal information

Regarding the accuracy aspect of inference (Section 5.4), the responses show a simpler pattern. In general, replay conditions make participants more confident in their picking and navigation inference choices, as shown in Figures 20 and 21. For placement, people are more confident when there are both physical replay and verbal indicators, shown in Figure 22. Thus, we recommend physical replays to increase confidence level for picking and navigation inferences. For placement, we suggest including verbal indicators with physical replay.

6.4 Mental workload for past causal information inference

In terms mental workload (Sections 5.5 and 5.7), replay conditions generally performed the best. As shown in Figure 24, the Project condition is largely a few Likert item levels worse and the Say condition is also somewhat the same. The reason for the Project case might be that even the projection is directly on the operating environment, its meaning is rather implicit even though the projection itself is explicit. So, in practice, projection indicators should be combined with the physical replay to achieve other benefits discussed throughout this section.

6.5 Perceived trust as a result of causal inference indications

As shown in Figure 24 of Section 5.6, conditions with multiple cues have more positive ratings in the trust subscales, including predictability, reliability, competence, and trust. In summary, replay conditions perform better than non-replay conditions, and Replay-Project and Replay-Project-Say are among the best two, in which there are always statistically significant results found when compared with non-replay conditions. So, we would recommend the Replay-Project and Replay-Project-Say conditions to achieve slightly better results.

6.6 Summary of Findings

Table 3. A comparison of the seven communication modalities across task-based dependent measures

		Replay conditions (with physical replay)
Measure	Task	Replay	Replay-Say	Replay-Project	Replay-Project-Say
Inference effectiveness (§5.1 & §5.2)	Pick. (§5.1.1)	4^th most effective	1^st most effective	2^nd most effective^a	2^nd most effective^a
	Nav. (§5.1.2)	4^th most effective^a	6^th most effective	2^nd most effective	3^rd most effective
	Place. (§5.1.3)	4^th most effective	3^rd most effective	1^st most effective	5^th most effective
Inference efficiency (§5.3)	Pick. (§5.3.1)	6^th fastest ^a	6^th fastest^a	4^th fastest^a	4^th fastest^a
	Nav. (§5.3.2)	7^th fastest	6^th fastest	4^th fastest	5^th fastest
	Place. (§5.3.3)	6^th fastest	4^th fastest	7^th fastest	5^th fastest
Inference confidence (§5.4)	Pick. (§5.4.1)	1^st most confident^b	1^st most confident^b	1^st most confident^b	1^st most confident^b
	Nav. (§5.4.2)	1^st most confident^b	1^st most confident^b	1^st most confident^b	1^st most confident^b
	Place. (§5.4.3)	3^rd most confident	1^st most confident	3^rd most confident	1^st most confident
			Non-replay conditions (without physical replay)
Measure	Task		Say	Project	Project-Say
Inference effectiveness (§5.1 & §5.2)	Pick. (§5.1.1)		7^th most effective	6^th most effective	5^th most effective
	Nav. (§5.1.2)		7^th most effective	1^st most effective	4^th most effective^a
	Place. (§5.1.3)		6^th most effective	6^th most effective	2^nd most effective
Inference efficiency (§5.3)	Pick. (§5.3.1)		1^st fastest	3^rd fastest	2^nd fastest
	Nav. (§5.3.2)		3^rd fastest	1^st fastest^a	1^st fastest^a
	Place. (§5.3.3)		2^nd fastest	1^st fastest	2^nd fastest
Inference confidence (§5.4)	Pick. (§5.4.1)		5^th most confident^c	5^th most confident^c	5^th most confident^c
	Nav. (§5.4.2)		7^th most confident^d	6^th most confident^d	5^th most confident^d
	Place. (§5.4.3)		3^rd most confident	6^th most confident	Any (n.s.)^e
^aEqual or almost equal percentage. ^bNo statistical significance found with replay conditions, but statistical significance found with non-replay conditions. ^cNo statistical significance found with non-replay conditions, but statistical significance found with replay conditions. ^dStatistical significance found with replay conditions. For non-replay conditions, significant difference was found between Say and Project-Say. ^eNo statistical significance was found with any other condition.

Table 4. A comparison of the seven communication modalities across team-based dependent measures

	Replay conditions (with physical replay)
Measure	Replay	Replay-Say	Replay-Project	Replay-Project-Say
Workload (§5.5)	1^st least demanding	3^rd least demanding	3^rd least demanding	1^st least demanding
	1^st lowest time pressure	1^st lowest time pressure	1^st lowest time pressure	1^st lowest time pressure
	1^st highest performance	1^st highest performance	1^st highest performance	1^st highest performance
	1^st lowest effort	1^st lowest effort	1^st lowest effort	1^st lowest effort
	1^stleast frustrated	1^stleast frustrated	1^stleast frustrated	1^stleast frustrated
Trust (§5.6)	1^st most predictable	1^st most predictable	1^st most predictable	1^st most predictable
	1^st most reliable	1^st most reliable	1^st most reliable	1^st most reliable
	1^sthightly competent	1^sthightly competent	1^sthightly competent	1^sthightly competent
	1^st most trustworthy	1^st most trustworthy	1^st most trustworthy	1^st most trustworthy
		Non-replay conditions (without physical replay)
Measure		Say	Project	Project-Say
Workload (§5.5)		3^rd least demanding	7^th least demanding	3^rd least demanding
		1^st lowest time pressure	6^th lowest time pressure	6^th lowest time pressure
		5^th highest performance	7^th highest performance	5^th highest performance
		1^st lowest effort	7^th lowest effort	1^st lowest effort
		5^th least frustrated	6^th least frustrated	6^th least frustrated
Trust (§5.6)		1^st most predictable	1^st most predictable	1^st most predictable
		1^st most reliable	1^st most reliable	1^st most reliable
		6^th highly competent	6^th highly competent	1^st highly competent
		1^st most trustworthy	1^st most trustworthy	1^st most trustworthy
Median ratings are shown.

To ease comprehension and reinforce our recommendations to help utilize the findings, we created five tables. Table 3 and 4 present each condition’s relative rank across task-based and team-based dependent measures. Task-based measures (Table 3) contains inference effectiveness, inference efficiency, and inference confidence. Team-based measures (Table 4) contains workload and trust.

Table 5. Rankings of the seven communication modalities across task-based dependent measures. Lower scores indicate more favorable performance and are in bold.

Measure		Replay-Project-Say	Replay-Project	Project-Say	Replay-Say	Project	Replay	Say
Inference effectiveness (§5.1 & §5.2)	Picking effectiveness	2	2	5	1	6	4	7
	Navigation effectiveness	3	2	4	6	1	4	7
	Placement effectiveness	5	1	2	3	6	4	6
Inference efficiency (§5.3)	Picking efficiency	4	4	2	6	3	6	1
	Navigation efficiency	5	4	1	6	1	7	3
	Placement efficiency	5	7	2	4	1	6	2
Inference confidence (§5.4)	Picking confidence	1	3	5	2	6	4	7
	Navigation confidence	2	3	4	5	6	1	7
	Placement confidence	1	3	5	2	7	4	6
	Picking Subtotal	7	9	12	9	15	14	15
	Navigation Subtotal	10	9	9	17	8	12	17
	Placement Subtotal	11	11	9	9	14	14	14
	*Total*	28	29	30	35	37	40	46
	Rank	1^st	2^nd	3^rd	4^th	5^th	6^th	7^rth
Median ratings are used for effectiveness and efficiency scores. However, mean values are used for confidence Likert responses to break ties, as the median rankings are seen in Table 8 and visually from Figure 20 to 22.

Table 6. Rankings of the seven communication modalities across team-based dependent measures. Lowerscores indicate more favorable performance and are in bold.

		Replay-Project-Say	Replay-Project	Replay	Replay-Say	Say	Project-Say	Project
Workload (§5.5)	Mental demand	2	3	1	4	5	6	7
	Temporal demand	5	4	2	1	3	6	7
	Performance	6	5	2	4	5	3	7
	Effort	1	4	2	3	5	6	7
	Frustration Level	1	2	4	3	5	6	7
Trust (§5.6)	Predictability	1	2	4	5	6	3	7
	Reliability	1	2	4	3	6	5	7
	Competence	1	2	4	3	6	5	7
	Trust	2	1	4	3	5	6	7
	Workload Subtotal	15	18	11	15	23	27	35
	Trust Subtotal	5	7	16	14	23	19	28
	*Total*	20	25	27	32	46	46	63
	Rank	1^st	2^nd	3^rd	4^th	5^th	5^th	7^rth
Mean values are used for these Likert responses to break ties, as the median rankings are seen in Table 9 and Figure 23 and 24.

Table 7. Rankings of the seven communication modalities across all dependent measures. Lower scores indicate more favorable performance and are in bold.

	Replay-Project-Say	Replay-Project	Replay	Replay-Say	Project-Say	Say	Project
Task-based subtotal	28	29	35	35	30	46	37
Team-based subtotal	20	25	27	32	46	46	63
*Total*	48	54	62	67	76	92	*100*
Rank	1^st	2^nd	3^rd	4^th	5^th	6^th	7^rth

Table 5 and 6 provides the sum of each rank to give an overall score across, again, task-based and team-based outcomes. Summing both, Table 7 shows the overall score across all measures.

As seen in Table 3, Replay-Project-Say, Replay-Project, and Project-Say all ranked higher (i.e., lower combined scores) on the task-based outcomes; we will refer to these as the top three conditions. In contrast, the remaining conditions, all with one modality except for Replay-Say, had lower ranks. At the subtask level, the top three conditions also ranked higher for picking and placement inference tasks, all in manipulation. However, for navigation inferences, all conditions with projections ranked well. So, if one is looking for manipulation inferences, all of the top three conditions work equally well. However, one can implement either Replay-Project or Project-Say, if one modality is not available, e.g., speech might not be distinguishable in a noisy environment such as warehouses. Regardless which condition is chosen, projection indicators must be combined with another modality for manipulation inferences. For navigation, however, projection alone is sufficient.

From team-based outcomes (Table 4), Replay-Project-Say ranked the first with the lowest score, followed by the other three replay conditions and non-replay conditions. Although the multimodal Replay-Project-Say condition had a higher workload than Replay, we still recommend Replay-Project-Say because Replay had a higher sum of rank for trust measures.

In total (Table 7), the multimodal Replay-Project-Say condition ranked the highest. Removing speech, Replay-Project ranked the second with a drop of team-based scores. Category-wise, replay conditions ranked the top four while non-replay conditions occupied the last three. So, overall we recommend implementing all three communication modalities for robots to achieve the best in inference tasks and team-based measures.

6.7 Overall Recommendations

As a key takeaway, we recommend combining physical replay with speech and projection indicators (Replay-Project-Say) to help infer all the missing causal information (picking, navigation, and placement) from the robot’s past actions. This condition had the best outcome in both task-based – effectiveness, efficiency, and confidence – and team-based metrics – workload and trust. If one’s focus is efficiency, we recommend projection markers for navigation inferences and verbal markers for placing inferences.

7 Limitations and Future Work

In the study, we have placed the camcorder at the best viewing position and angle where a person can view the whole relevant scene and the robot. However, in reality, humans are not static as the camcorder: when not in a crowd, they may move around to better understand the event. We thought of recording multiple videos and placing them in a grid and presenting them to participants. However, we had the concern that this may distract their attention or increase workload, requiring them to re-focus as they move from one sub-video to another to look at different angles. Ideally, we can simulate the human walking path by moving the camcorder. However, the walking path would likely be different from person to person because the path is driven by the particular human’s thought process, thus it becomes a problem of its own. As future work, online studies with videos from multiple angles would be interesting to pursue, as would in person studies.

In addition, we tried to discretize time by having a fixed set of timing events for picking, navigation, and placing during the mobile manipulation task. In the past [45], we have analyzed the timing by recording videos of participants and extracting the timing information continuously frame by frame. In the future, one may do the same in order to get more accurate data directly from the continuous variable of time, instead of from a predefined set of discrete events.

To measure which method is more accurate, we measured the perceived accuracy using the subjective confidence metric. However, being more confident in inference-making may not be the same as the communication method being accurate. Likewise, these are not necessarily causal of each other. As there will be more research into comparisons of different communications, future researchers can explore different metrics or summarize these comparison work.

Like trust, causal inference-making is a process, rather than a single-time-point event. People may start inferring the past missing causal information at an earlier event but unsure yet, and then become more confident as the robot finishes the task. While we did attempt to capture this by having only part of the task sequence, e.g., only having projection in the Project condition, a better method would be implementing the think-aloud protocol or ask participants to finish the survey during the process to capture the change over time, just like Desai et al. did to study trust [29].

To investigate past behavior explanation, we have assumed that the worker is familiar with the robot and was able understand why the robot performed the actions after replay from the three scenario descriptions (See Section 1): it recognized a large wood chip as a large gearbox bottom so picked it up, it perceived a ground obstacle so detoured, and it may not correctly recognize the concave caddy with a complex of three sections. We argued that the robot in those cases may not know what it did wrong. Yet, this assumes that the robot is not robust to assess its behaviors. We are currently working with Gautam et al [36] on implementing proficiency assumption checkers, and hopefully this collaboration effort will shed some light to a robot’s self-assessment, so the robot is able to explain why the robot performed these actions.

As we drew inspiration from the human imitation literature, we can also learn from the teaching literature. It is well-known that learners have preferences in the communication channels, i.e., visual learners and auditory learners [37]. If robots could know this information beforehand, robots can better accommodate individual differences to produce better inference outcomes. This is particularly important for personal robots. A study into this preference can greatly complement this work where we targeted the general population.

Last but not least, although we have generalizability in mind while designing both manipulation and navigations tasks, and are confident the findings will be replicated to other domains, as stated in Section 1.2, further studies will be needed to validate the findings in other HRI applications that have their own unique contexts.

8 Conclusion

In this work, we have investigated how a robot could communicate past causal information in an integrated mobile manipulation scenario – encompassing three distinct robot tasks, i.e., picking, navigation, and placement, to best understand whether some results may or may not generalize to different tasks. Physical replay with head, arm, and base movement is implemented with verbal and projection indicators. We analyzed the results in a hypotheses-oriented manner and drew conclusions by the three different tasks.

Utilizing the three communication modalities, participants were able to infer the missing causal information. In summary, results suggest a multimodal approach: combining physical replay with verbal and projection indicators performs the best in helping participants to infer where the robot picked an object, where the ground obstacle is, and where the robot has placed an object into a caddy. We did find the projection and verbal indicators perform differently in different tasks. Specifically, we found that projection markers alone are remarkably efficient in helping people make navigation inferences, while verbal indicators are exceptionally efficient for making placing inferences.

Appendix A Appendix

Table 8. Median Rankings of the seven communication modalities across task-based dependent measures. Lower scores indicate more favorable performance and are in bold.

Measure		Replay-Project	Replay-Project-Say	Replay-Say	Project-Say	Project	Replay	Say
Inference effectiveness (§5.1 & §5.2)	Picking effectiveness	2	2	1	5	6	4	7
	Navigation effectiveness	2	3	6	4	1	4	7
	Placement effectiveness	1	5	3	2	6	4	6
Inference efficiency (§5.3)	Picking efficiency	4	4	6	2	3	6	1
	Navigation efficiency	4	5	6	1	1	7	3
	Placement efficiency	7	5	4	2	1	6	2
Inference confidence (§5.4)	Picking confidence	1	1	1	5	5	1	5
	Navigation confidence	1	1	1	5	6	1	7
	Placement confidence	3	1	1	4	6	3	3
	Picking Subtotal	7	7	8	12	14	11	13
	Navigation Subtotal	7	9	13	10	8	12	17
	Placement Subtotal	11	11	8	8	13	13	11
	*Total*	25	27	29	30	35	36	41
	Rank	1^st	2^nd	3^rd	4^th	5^th	6^th	7^rth
The corresponding table with mean rankings, to break ties, is Table 5.

Table 9. Median rankings of the seven communication modalities across team-based dependent measures. Lower scores indicate more favorable performance and are in bold.

		Replay	Replay-Project-Say	Replay-Say	Replay-Project	Say	Project-Say	Project
Workload (§5.5)	Mental demand	1	1	3	3	3	3	7
	Temporal demand	1	1	1	1	1	6	6
	Performance	1	1	1	1	5	5	7
	Effort	1	1	1	1	1	1	7
	Frustration Level	1	1	1	1	5	6	6
Trust (§5.6)	Predictability	1	1	1	1	1	1	1
	Reliability	1	1	1	1	1	1	1
	Competence	1	1	1	1	6	1	6
	Trust	1	1	1	1	1	1	1
	*Total*	9	9	11	11	24	25	42
	Rank	1^st	1^st	3^rd	3^rd	5^th	6^th	7^rth
The corresponding table with mean rankings, to break ties, is Table 6.

References

[1] Nichola Abdo, Cyrill Stachniss, Luciano Spinello, and Wolfram Burgard. 2015. Robot, organize my shelves! Tidying up objects by predicting user preferences. In 2015 IEEE international conference on robotics and automation (ICRA). IEEE, 1557–1564.

[2] Amina Adadi and Mohammed Berrada. 2018. Peeking inside the black-box: a survey on explainable artificial intelligence (XAI). IEEE Access 6 (2018), 52138–52160.

[3] Henny Admoni and Brian Scassellati. 2017. Social eye gaze in human-robot interaction: a review. ACM Transactions on Human-Robot Interaction 6, 1 (2017), 25–63.

[4] Siddharth Agrawal and Mary-Anne Williams. 2017. Robot authority and human obedience: A study of human behaviour using a robot security guard. In Proceedings of the companion of the 2017 ACM/IEEE international conference on human-robot interaction. 57–58.

[5] Dan Amir and Ofra Amir. 2018. Highlights: Summarizing agent behavior to people. In Proceedings of the 17th International Conference on Autonomous Agents and MultiAgent Systems. 1168–1176.

[6] Ofra Amir, Finale Doshi-Velez, and David Sarne. 2018. Agent strategy summarization. In Proceedings of the 17th International Conference on Autonomous Agents and MultiAgent Systems. 1203–1207.

[7] Alexander Mois Aroyo, Francesco Rea, Giulio Sandini, and Alessandra Sciutti. 2018. Trust and social engineering in human robot interaction: Will a robot make you disclose sensitive information, conform to its recommendations or gamble? IEEE Robotics and Automation Letters 3, 4 (2018), 3701–3708.

[8] Alejandro Barredo Arrieta, Natalia Díaz-Rodríguez, Javier Del Ser, Adrien Bennetot, Siham Tabik, Alberto Barbado, Salvador García, Sergio Gil-López, Daniel Molina, Richard Benjamins, et al. 2020. Explainable Artificial Intelligence (XAI): Concepts, taxonomies, opportunities and challenges toward responsible AI. Information Fusion 58 (2020), 82–115.

[9] Shelly Bagchi, Jason R Wilson, Muneeb I Ahmad, Christian Dondrup, Zhao Han, Justin W Hart, Matteo Leonetti, Katrin Lohan, Ross Mead, Emmanuel Senft, et al. 2020. Proceedings of the AI-HRI Symposium at AAAI-FSS 2020. arXiv preprint arXiv:2010.13830 (2020).

[10] Albert Bandura. 1999. Social cognitive theory: An agentic perspective. Asian journal of social psychology 2, 1 (1999), 21–41.

[11] Albert Bandura. 2008. Observational Learning. American Cancer Society.

[12] Rachel Barr and Nancy Wyss. 2008. Reenactment of televised content by 2-year olds: Toddlers use language learned from television to solve a difficult imitation problem. Infant Behavior and Development 31, 4 (2008), 696–703.

[13] Gisela Böhm and Hans-Rüdiger Pfister. 2015. How people explain their own and others’ behavior: a theory of lay causal explanations. Frontiers in Psychology 6 (2015), 139.

[14] Jonathan Bohren, Radu Bogdan Rusu, E Gil Jones, Eitan Marder-Eppstein, Caroline Pantofaru, Melonee Wise, Lorenz Mösenlechner, Wim Meeussen, and Stefan Holzer. 2011. Towards autonomous robotic butlers: Lessons learned with the PR2. In 2011 IEEE International Conference on Robotics and Automation. IEEE, 5568–5575.

[15] Daniel J Brooks, Momotaz Begum, and Holly A Yanco. 2016. Analysis of reactions towards failures and recovery strategies for autonomous robots. In 25th IEEE International Symposium on Robot and Human Interactive Communication (RO-MAN). IEEE, 487–492.

[16] Daphna Buchsbaum, Alison Gopnik, Thomas L Griffiths, and Patrick Shafto. 2011. Children’s imitation of causal action sequences is influenced by statistical and pedagogical evidence. Cognition 120, 3 (2011), 331–340.

[17] David Buttelmann, Andy Schieler, Nicole Wetzel, and Andreas Widmann. 2017. Infants’ and adults’ looking behavior does not indicate perceptual distraction for constrained modelled actions- An eye-tracking study. Infant Behavior and Development 47 (2017), 103–111.

[18] Elizabeth Cha, Yunkyung Kim, Terrence Fong, Maja J Mataric, et al. 2018. A survey of nonverbal signaling methods for non-humanoid robots. Foundations and Trends® in Robotics 6, 4 (2018), 211–323. PDF is available at https://www.lizcha.com/publications/ft_2018.pdf.

[19] Ravi Teja Chadalavada, Henrik Andreasson, Robert Krug, and Achim J Lilienthal. 2015. That’s on my mind! robot to human intention communication through on-board projection on shared floor space. In 2015 European Conference on Mobile Robots (ECMR). 1–6.

[20] Tathagata Chakraborti, Sarath Sreedharan, Yu Zhang, and Subbarao Kambhampati. 2017. Plan explanations as model reconciliation: moving beyond explanation as soliloquy. In Proceedings of the 26th International Joint Conference on Artificial Intelligence. 156–163.

[21] Tiffany L Chen, Matei Ciocarlie, Steve Cousins, Phillip M Grice, Kelsey Hawkins, Kaijen Hsiao, Charles C Kemp, Chih-Hung King, Daniel A Lazewatsky, Adam E Leeper, et al. 2013. Robots for humanity: using assistive robotics to empower people with disabilities. IEEE Robotics & Automation Magazine 20, 1 (2013), 30–39.

[22] Michele Colledanchise and Petter Ogren. 2018. Behavior Trees in Robotics and Al: An Introduction (1st ed.). CRC Press, Boca Raton, FL, USA.

[23] Michael D Coovert, Tiffany Lee, Ivan Shindev, and Yu Sun. 2014. Spatial augmented reality as a method for a mobile robot to communicate intended movement. Computers in Human Behavior 34 (2014), 241–248.

[24] Devleena Das, Siddhartha Banerjee, and Sonia Chernova. 2021. Explainable ai for robot failures: Generating explanations that improve user assistance in fault recovery. In Proceedings of the 2021 ACM/IEEE International Conference on Human-Robot Interaction. 351–360.

[25] Maartje MA De Graaf and Bertram F Malle. 2017. How people explain action (and autonomous intelligent systems should too). In 2017 AAAI Fall Symposium Series.

[26] Maartje MA De Graaf and Bertram F Malle. 2019. People’s explanations of robot behavior subtly reveal mental state inferences. In 2019 14th ACM/IEEE International Conference on Human-Robot Interaction (HRI). IEEE, 239–248.

[27] Maartje MA de Graaf, Bertram F Malle, Anca Dragan, and Tom Ziemke. 2018. Explainable robotic systems. In Companion of the 2018 ACM/IEEE International Conference on Human-Robot Interaction. 387–388.

[28] Maartje M. A. De Graaf, Anca Dragan, Bertram F. Malle, and Tom Ziemke. 2021. Introduction to the Special Issue on Explainable Robotic Systems. ACM Transactions on Human-Robot Interaction 10, 3, Article 22 (2021), 4 pages.

[29] Munjal Desai, Poornima Kaniarasu, Mikhail Medvedev, Aaron Steinfeld, and Holly Yanco. 2013. Impact of robot failures and feedback on real-time trust. In 2013 8th ACM/IEEE International Conference on Human-Robot Interaction (HRI). IEEE, 251–258.

[30] Munjal Desai, Mikhail Medvedev, Marynel Vázquez, Sean McSheehy, Sofia Gadea-Omelchenko, Christian Bruggeman, Aaron Steinfeld, and Holly Yanco. 2012. Effects of changing reliability on trust of robot systems. In 2012 7th ACM/IEEE International Conference on Human-Robot Interaction (HRI). IEEE, 73–80.

[31] Anca D Dragan, Kenton CT Lee, and Siddhartha S Srinivasa. 2013. Legibility and predictability of robot motion. In 2013 8th ACM/IEEE International Conference on Human-Robot Interaction (HRI). IEEE, 301–308.

[32] Sarah Elliott, Zhe Xu, and Maya Cakmak. 2017. Learning generalizable surface cleaning actions from demonstration. In 2017 26th IEEE International Symposium on Robot and Human Interactive Communication (RO-MAN). IEEE, 993–999.

[33] Franz Faul, Edgar Erdfelder, Albert-Georg Lang, and Axel Buchner. 2007. G* Power 3: A flexible statistical power analysis program for the social, behavioral, and biomedical sciences. Behavior Research Methods 39, 2 (2007), 175–191. Software available at http://www.gpower.hhu.de/.

[34] Mary Ellen Foster, Bart Craenen, Amol Deshmukh, Oliver Lemon, Emanuele Bastianelli, Christian Dondrup, Ioannis Papaioannou, Andrea Vanzo, Jean-Marc Odobez, Olivier Canévet, et al. 2019. Mummer: Socially intelligent human-robot interaction in public spaces. arXiv preprint arXiv:1909.06749 (2019).

[35] Amy K Gardiner, Marissa L Greif, and David F Bjorklund. 2011. Guided by intention: Preschoolers’ imitation reflects inferences of causation. Journal of Cognition and Development 12, 3 (2011), 355–373.

[36] Alvika Gautam, Jacob W Crandall, and Michael A Goodrich. 2020. Self-assessment of Proficiency of Intelligent Systems: Challenges and Opportunities. In International Conference on Applied Human Factors and Ergonomics. Springer, 108–113.

[37] Abbas Pourhossein Gilakjani et al. 2012. Visual, auditory, kinaesthetic learning styles and their impacts on English language teaching. Journal of Studies in Education 2, 1 (2012), 104–113.

[38] David Gunning, Mark Stefik, Jaesik Choi, Timothy Miller, Simone Stumpf, and Guang-Zhong Yang. 2019. XAI—Explainable artificial intelligence. Science Robotics 4, 37 (2019).

[39] Zhao Han, Jordan Allspaw, Gregory LeMasurier, Jenna Parrillo, Daniel Giger, S Reza Ahmadzadeh, and Holly A Yanco. 2020. Towards Mobile Multi-Task Manipulation in a Confined and Integrated Environment with Irregular Objects. In 2020 IEEE International Conference on Robotics and Automation (ICRA). IEEE, 11025–11031.

[40] Zhao Han, Daniel Giger, Jordan Allspaw, Michael S Lee, Henny Admoni, and Holly A Yanco. 2021. Building The Foundation of Robot Explanation Generation Using Behavior Trees. ACM Transactions on Human-Robot Interaction 10, 3 (2021), 31 pages.

[41] Zhao Han, Jenna Parrillo, Alexander Wilkinson, Holly A Yanco, and Tom Williams. 2022. Projecting Robot Navigation Paths: Hardware and Software for Projected AR. In 2022 ACM/IEEE International Conference on Human-Robot Interaction (HRI), Short Contributions.

[42] Zhao Han, Elizabeth Phillips, and Holly A Yanco. 2021. The Need for Verbal Robot Explanations and How People Would Like a Robot To Explain Itself. ACM Transactions on Human-Robot Interaction 10, 4 (2021).

[43] Zhao Han, Alexander Wilkinson, Jenna Parrillo, Jordan Allspaw, and Holly A Yanco. 2020. Projection Mapping Implementation: Enabling Direct Externalization of Perception Results and Action Intent to Improve Robot Explainability. The AAAI Fall Symposium on The Artificial Intelligence for Human-Robot Interaction 2020 (AI-HRI) (2020).

[44] Zhao Han, Tom Williams, and Holly A Yanco. 2022. Mixed-Reality Robot Behavior Replay: A System Implementation. In 2022 AAAI Fall Symposium on The Artificial Intelligence for Human-Robot Interaction (AI-HRI).

[45] Zhao Han and Holly Yanco. 2019. The effects of proactive release behaviors during human-robot handovers. In 2019 14th ACM/IEEE International Conference on Human-Robot Interaction (HRI). IEEE, 440–448.

[46] Sandra G Hart. 2006. NASA-task load index (NASA-TLX); 20 years later. In Proceedings of the human factors and ergonomics society annual meeting, Vol. 50. Sage publications Sage CA: Los Angeles, CA, 904–908.

[47] Kunimatsu Hashimoto, Fuminori Saito, Takashi Yamamoto, and Koichi Ikeda. 2013. A field study of the human support robot in the home environment. In 2013 IEEE Workshop on Advanced Robotics and its Social Impacts. IEEE, 143–150.

[48] Bradley Hayes and Julie A Shah. 2017. Improving robot controller transparency through autonomous policy explanation. In 2017 12th ACM/IEEE International Conference on Human-Robot Interaction (HRI. IEEE, 303–312.

[49] Stefanie Hoehl, Stefanie Keupp, Hanna Schleihauf, Nicola McGuigan, David Buttelmann, and Andrew Whiten. 2019. ‘Over-imitation’: A review and appraisal of a decade of research. Developmental Review 51 (2019), 90–108.

[50] Hiroshi Ishiguro, Tetsuo Ono, Michita Imai, Takeshi Maeda, Takayuki Kanda, and Ryohei Nakatsu. 2001. Robovie: an interactive humanoid robot. Industrial robot: An international journal 28, 6 (2001), 498–504.

[51] Gayane Kazhoyan, Simon Stelter, Franklin Kenghagho Kenfack, Sebastian Koralewski, and Michael Beetz. 2021. The robot household marathon experiment. In 2021 IEEE International Conference on Robotics and Automation (ICRA). IEEE, 9382–9388.

[52] Charles C Kemp, Aaron Edsinger, Henry M Clever, and Blaine Matulevich. 2021. The Design of Stretch: A Compact, Lightweight Mobile Manipulator for Indoor Human Environments. arXiv preprint arXiv:2109.10892 (2021).

[53] Ben Kenward. 2012. Over-imitating preschoolers believe unnecessary actions are normative and enforce their performance by a third party. Journal of experimental child psychology 112, 2 (2012), 195–207.

[54] Annette M Klein, Petra Hauf, and Gisa Aschersleben. 2006. The role of action effects in 12-month-olds’ action control: A comparison of televised model and live model. Infant Behavior and Development 29, 4 (2006), 535–544.

[55] Lars Kunze, Michael Beetz, Manabu Saito, Haseru Azuma, Kei Okada, and Masayuki Inaba. 2012. Searching objects in large-scale indoor environments: A decision-theoretic approach. In 2012 IEEE International Conference on Robotics and Automation. IEEE, 4385–4390.

[56] Minae Kwon, Sandy H Huang, and Anca D Dragan. 2018. Expressing robot incapability. In Proceedings of the 2018 ACM/IEEE International Conference on Human-Robot Interaction. 87–95.

[57] Tania Lombrozo. 2006. The structure and function of explanations. Trends in cognitive sciences 10, 10 (2006), 464–470.

[58] Derek E Lyons, Andrew G Young, and Frank C Keil. 2007. The hidden structure of overimitation. Proceedings of the National Academy of Sciences 104, 50 (2007), 19751–19756.

[59] Bertram F Malle. 2006. How the mind explains behavior: Folk explanations, meaning, and social interaction. MIT Press.

[60] Akihiro Matsufuji and Angelica Lim. 2021. Perceptual Effects of Ambient Sound on an Artificial Agent’s Rate of Speech. In Companion of the 2021 ACM/IEEE International Conference on Human-Robot Interaction. 67–70.

[61] Stephen Miller, Jur Van Den Berg, Mario Fritz, Trevor Darrell, Ken Goldberg, and Pieter Abbeel. 2012. A geometric approach to robotic laundry folding. The International Journal of Robotics Research 31, 2 (2012), 249–267.

[62] Tim Miller. 2019. Explanation in artificial intelligence: Insights from the social sciences. Artificial Intelligence 267 (2019), 1–38.

[63] AJung Moon, Daniel M Troniak, Brian Gleeson, Matthew KXJ Pan, Minhua Zheng, Benjamin A Blumer, Karon MacLean, and Elizabeth A Croft. 2014. Meet me where I’m gazing: how shared attention gaze affects human-robot handover timing. In Proceedings of the 2014 ACM/IEEE International Conference on Human-Robot Interaction. 334–341.

[64] Bonnie M Muir. 1987. Trust between humans and machines, and the design of decision aids. International Journal of Man-Machine Studies 27, 5-6 (1987), 527–539.

[65] Donna L Mumme and Anne Fernald. 2003. The infant as onlooker: Learning from emotional reactions observed in a television scenario. Child Development 74, 1 (2003), 221–237.

[66] Hirenkumar Nakawala, Paulo JS Goncalves, Paolo Fiorini, Giancarlo Ferringo, and Elena De Momi. 2018. Approaches for action sequence representation in robotics: a review. In 2018 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS). IEEE, 5666–5671.

[67] Allen Newell et al. 1982. The knowledge level. Artificial Intelligence 18, 1 (1982), 87–127.

[68] Hai Nguyen, Cressel Anderson, Alexander Trevor, Advait Jain, Zhe Xu, and Charles C Kemp. 2008. El-e: An assistive robot that fetches objects from flat surfaces. In Robotic helpers, int. conf. on human-robot interaction.

[69] Aaron van den Oord, Sander Dieleman, Heiga Zen, Karen Simonyan, Oriol Vinyals, Alex Graves, Nal Kalchbrenner, Andrew Senior, and Koray Kavukcuoglu. 2016. Wavenet: A generative model for raw audio. arXiv preprint arXiv:1609.03499 (2016).

[70] Stefan Palan and Christian Schitter. 2018. Prolific.ac–A subject pool for online experiments. Journal of Behavioral and Experimental Finance 17 (2018), 22–27. Available at https://prolific.co/.

[71] Forough Poursabzi-Sangdeh, Daniel G Goldstein, Jake M Hofman, Jennifer Wortman Wortman Vaughan, and Hanna Wallach. 2021. Manipulating and measuring model interpretability. In Proceedings of the 2021 CHI Conference on Human Factors in Computing Systems. 1–52.

[72] Michael A. Rosen, Eduardo Salas, Davin Pavlas, Randy Jensen, Dan Fu, and Donald Lampton. 2010. Demonstration-Based Training: A Review of Instructional Features. Human Factors 52, 5 (2010), 596–609.

[73] Emrah Akin Sisbot, Rachid Alami, Thierry Siméon, Kerstin Dautenhahn, Michael Walters, and Sarah Woods. 2005. Navigation in the presence of humans. In 5th IEEE-RAS International Conference on Humanoid Robots, 2005. IEEE, 181–188.

[74] Kristyn Sommer, Rebecca Davidson, Kristy L Armitage, Virginia Slaughter, Janet Wiles, and Mark Nielsen. 2020. Preschool children overimitate robots, but do so less than they overimitate humans. Journal of Experimental Child Psychology 191 (2020), 104702.

[75] Siddhartha S Srinivasa, Dave Ferguson, Casey J Helfrich, Dmitry Berenson, Alvaro Collet, Rosen Diankov, Garratt Gallagher, Geoffrey Hollinger, James Kuffner, and Michael Vande Weghe. 2010. HERB: a home exploring robotic butler. Autonomous Robots 28, 1 (2010), 5–20.

[76] Sonja Stange and Stefan Kopp. 2020. Effects of a Social Robot’s Self-Explanations on How Humans Understand and Evaluate Its Behavior. In Proceedings of the 2020 ACM/IEEE International Conference on Human-Robot Interaction. 619–627.

[77] Aaron Steinfeld and Michael Goodrich. 2020. Assessing, Explaining, and Conveying Robot Proficiency for Human-Robot Teaming. In Companion of the 2020 ACM/IEEE International Conference on Human-Robot Interaction. 662–662.

[78] Paul J Taylor, Darlene F Russ-Eft, and Daniel WL Chan. 2005. A meta-analytic review of behavior modeling training. Journal of Applied Psychology 90, 4 (2005), 692.

[79] Sam Thellman, Annika Silvervarg, and Tom Ziemke. 2017. Folk-psychological interpretation of human vs. humanoid robot behavior: Exploring the intentional stance toward robots. Frontiers in Psychology 8 (2017).

[80] Ning Wang, David V Pynadath, and Susan G Hill. 2016. Trust calibration within a human-robot team: Comparing automatically generated explanations. In 2016 11th ACM/IEEE International Conference on Human-Robot Interaction (HRI). IEEE, 109–116.

[81] Auriel Washburn, Akanimoh Adeleye, Thomas An, and Laurel D Riek. 2020. Robot errors in proximate HRI: how functionality framing affects perceived reliability and trust. ACM Transactions on Human-Robot Interaction (THRI) 9, 3 (2020), 1–21.

[82] Eric M Wetzel, Muhammad Umer, Will Richardson, and Justin Patton. 2022. A Step Towards Automated Tool Tracking on Construction Sites: Boston Dynamics SPOT and RFID. EPiC Series in Built Environment 3 (2022), 488–496.

[83] Andrew Whiten, Gillian Allan, Siobahn Devlin, Natalie Kseib, Nicola Raw, and Nicola McGuigan. 2016. Social learning in the real-world:‘Over-imitation’occurs in both children and adults unaware of participation in an experiment and independently of social interaction. PloS one 11, 7 (2016), e0159920.

[84] Christopher D Wickens, Justin G Hollands, Simon Banbury, and Raja Parasuraman. 2015. Engineering psychology and human performance. Psychology Press.

[85] Alexander Wilkinson, Michael Gonzales, Patrick Hoey, David Kontak, Dian Wang, Noah Torname, Sam Laderoute, Zhao Han, Jordan Allspaw, Robert Platt, et al. 2021. Design guidelines for human–robot interaction with assistive robot manipulation systems. Paladyn, Journal of Behavioral Robotics 12, 1 (2021), 392–401.

[86] Melonee Wise, Michael Ferguson, Derek King, Eric Diehr, and David Dymesich. 2016. Fetch and freight: Standard platforms for service robot applications. In Workshop on Autonomous Mobile Service Robots.

[87] Huaxia Xia and Haiming Yang. 2018. Is last-mile delivery a’killer app’for self-driving vehicles? Commun. ACM 61, 11 (2018), 70–75.

[88] Lixiao Zhu and Thomas Williams. 2020. Effects of Proactive Explanations by Robots on Human-Robot Trust. In International Conference on Social Robotics. Springer, 85–95.

[89] Norbert Zmyj, Moritz M Daum, and Gisa Aschersleben. 2009. The development of rational imitation in 9-and 12-month-old infants. Infancy 14, 1 (2009), 131–141.