Context-Aware Planning and Environment-Aware Memory for Instruction Following Embodied Agents

ICCV 2023

1Yonsei University, 2Gwangju Institute of Science and Technology

🏆 Challenge Winners 🏆

1st Generalist Language Grounding Agents Challenge (CVPRW'23)


Context-Aware Planning and Environment-Aware Memory (CAPEAM)

Context-Aware Planning (CAP)

Environment-Aware Memory (EAM)

Abstract

Accomplishing household tasks requires to plan step-by-step actions considering the consequences of previous actions. However, the state-of-the-art embodied agents often make mistakes in navigating the environment and interacting with proper objects due to imperfect learning by imitating experts or algorithmic planners without such knowledge. To improve both visual navigation and object interaction, we propose to consider the consequence of taken actions by CAPEAM (Context-Aware Planning and Environment-Aware Memory) that incorporates semantic context (e.g., appropriate objects to interact with) in a sequence of actions, and the changed spatial arrangement and states of interacted objects (e.g., location that the object has been moved to) in inferring the subsequent actions. We empirically show that the agent with the proposed CAPEAM achieves state-of-the-art performance in various metrics using a challenging interactive instruction following benchmark in both seen and unseen environments by large margins (up to +10.70% in unseen env.).

Context-Aware Planning and Environment-Aware Memory

The state-of-the-art embodied agents often make mistakes in navigating the environment and interacting with proper objects due to imperfect learning by imitating experts or algorithmic planners without such knowledge. To address the issue, we propose CAPEAM to incorporate the contextual information of previous actions for planning and maintaining spatial arrangement of objects with their states (e.g., if an object has been moved or not) in an environment to the perception model for improving both visual navigation and object interaction.

Context-Aware Planning

To plan a sequence of sub-goals conditioned on the task-relevant objects, we first define `context' as a set of task-relevant objects shared across sub-goals of a given task. The proposed `context-aware planning' (CAP) divides planning into two phases; 1) a sub-goal planner which generates sub-goals, and 2) a detailed planner which is responsible for a sequence of detailed actions and objects for interaction for each sub-goal.

The sub-goal planner further comprises two sub-modules: the context predictor, which predicts three task-relevant objects, and the sub-goal frame sequence generator, which generates a sequence of sub-goals that do not rely on particular objects, referred to as sub-goal frames. We integrate these predicted task-specific objects into a sequence of sub-goal frames to produce a sub-goal sequence.

Environment-Aware Memory

To enable the agent to keep track of the configurations of objects, we propose to configure memory of past environmental information for proper action sequence prediction during a task.

Retrospective Object Recognition. To allow the agent to keep interacting with the same object even with the visual appearance changes during multiple interactions, we propose to retain the latest segmentation masks of objects and use them as the current object's mask if the agent is interacting with the same object but fails to recognize it.

Object Relocation Tracking. To allow the agent to avoid redundant interaction with already relocated objects, we propose to maintain the information about the most recent location of each relocated object and exclude them in the semantic map as a future target for navigation.

Object Location Caching. To reduce the need for agents to explore an environment again, we propose to cache the locations and the masks in memory for objects whose states change such that the agent can navigate back to and interact with the remembered locations and object masks.

Results

We employ ALFRED to evaluate our method. There are three splits of environments in ALFRED: ‘train’, ‘validation’, and ‘test’. The validation and test environments are further divided into two folds, seen and unseen, to assess the generalization capacity. The primary metric is the success rate, denoted by ‘SR,’ which measures the percentage of completed tasks. Another metric is the goal-condition success rate, denoted by ‘GC,’ which measures the percentage of satisfied goal conditions. Finally, path-length-weighted (PLW) scores penalize SR and GC by the length of the actions that the agent takes.

When the hand-designed action sequence templates are combined with our agent ( in ‘Tem. Act.’), our agent outperforms all prior arts in novel environments in terms of success rates, which is the main metric of the benchmark. Without using the templated action sequences ( in ‘Tem. Act.’), our method outperforms all prior arts by large margins in SR and GC for both seen and unseen environments. As we consistently observe the improvements with and without the low-level instructions, this would imply that our method does not heavily rely on the detailed description of a task.

For more details, please check out the paper.


Comparison with State of the Art
Context-Aware Planning
Retrospective Object Recognition
Object Relocation Tracking
Object Location Caching

BibTeX

@inproceedings{kim2023context,
  author    = {Kim, Byeonghwi and Kim, Jinyeon and Kim, Yuyeong and Min, Cheolhong and Choi, Jonghyun},
  title     = {Context-Aware Planning and Environment-Aware Memory for Instruction Following Embodied Agents},
  booktitle = {ICCV},
  year      = {2023},
}