We address the interactive instruction following task which requires an agent to navigate through an environment, interact with objects, and complete long-horizon tasks, following natural language instructions with egocentric vision. To successfully achieve a goal in the interactive instruction following task, the agent should infer a sequence of actions and object interactions. When performing actions, a small field of view often limits the agentās understanding of an environment, leading to poor performance. Here, we propose to exploit surrounding views by additional observations from navigable directions to enlarge the field of view of the agent.
When performing actions, a small field of view often limits the agentās understanding of an environment, leading to poor performance. Here, we propose to exploit surrounding views by additional observations from navigable directions to enlarge the field of view of the agent. For richer information, we additionally gather observations from four navigation directions (in this case, left, right, up, and down) including the egocentric image for each time step.
Our model is largely based on MOCA including factorizing perception and policy, language-guided dynamic filters, object-centric localization with instance association, and obstruction evasion. For more details of the building components, please check out MOCA.
We employ ALFRED to evaluate our method. There are three splits of environments in ALFRED: ātrainā, āvalidationā, and ātestā. The validation and test environments are further divided into two folds, seen and unseen, to assess the generalization capacity. The primary metric is the success rate, denoted by āSR,ā which measures the percentage of completed tasks. Another metric is the goal-condition success rate, denoted by āGC,ā which measures the percentage of satisfied goal conditions. Finally, path-length-weighted (PLW) scores penalize SR and GC by the length of the actions that the agent takes.
The proposed method with surrounding perception outperforms the previous challenge winner [7] on all āTaskā and āGoal-Condā metrics in seen and unseen environments for both validation and test splits by large absolute margins.
For more details, please check out the paper.
@inproceedings{kim2021agent,
author = {Kim, Byeonghwi and Bhambri, Suvaansh and Singh, Kunal Pratap and Mottaghi, Roozbeh and Choi, Jonghyun},
title = {Agent with the Big Picture: Perceiving Surroundings for Interactive Instruction Following},
booktitle = {Embodied AI Workshop @ CVPR 2021},
year = {2021},
}