Agent with the Big Picture: Perceiving Surroundings for Interactive Instruction Following

Abstract

We address the interactive instruction following task which requires an agent to navigate through an environment, interact with objects, and complete long-horizon tasks, following natural language instructions with egocentric vision. To successfully achieve a goal in the interactive instruction following task, the agent should infer a sequence of actions and object interactions. When performing actions, a small field of view often limits the agent’s understanding of an environment, leading to poor performance. Here, we propose to exploit surrounding views by additional observations from navigable directions to enlarge the field of view of the agent.

Perceiving Surroundings for Larger Field of View

When performing actions, a small field of view often limits the agent’s understanding of an environment, leading to poor performance. Here, we propose to exploit surrounding views by additional observations from navigable directions to enlarge the field of view of the agent. For richer information, we additionally gather observations from four navigation directions (in this case, left, right, up, and down) including the egocentric image for each time step.

Our model is largely based on MOCA including factorizing perception and policy, language-guided dynamic filters, object-centric localization with instance association, and obstruction evasion. For more details of the building components, please check out MOCA.

Results

We employ ALFRED to evaluate our method. There are three splits of environments in ALFRED: ‘train’, ‘validation’, and ‘test’. The validation and test environments are further divided into two folds, seen and unseen, to assess the generalization capacity. The primary metric is the success rate, denoted by ‘SR,’ which measures the percentage of completed tasks. Another metric is the goal-condition success rate, denoted by ‘GC,’ which measures the percentage of satisfied goal conditions. Finally, path-length-weighted (PLW) scores penalize SR and GC by the length of the actions that the agent takes.

The proposed method with surrounding perception outperforms the previous challenge winner [7] on all “Task” and “Goal-Cond” metrics in seen and unseen environments for both validation and test splits by large absolute margins.

For more details, please check out the paper.

Comparison with State of the Art

BibTeX

@inproceedings{kim2021agent, author = {Kim, Byeonghwi and Bhambri, Suvaansh and Singh, Kunal Pratap and Mottaghi, Roozbeh and Choi, Jonghyun}, title = {Agent with the Big Picture: Perceiving Surroundings for Interactive Instruction Following}, booktitle = {Embodied AI Workshop @ CVPR 2021}, year = {2021}, }

Agent with the Big Picture: Perceiving Surroundings for Interactive Instruction Following

Embodied AI Workshop @ CVPR 2021

🏆 2nd Place 🏆 ALFRED Challenge (CVPRW'21)

Agent with the Big Picture (ABP)

Abstract

Perceiving Surroundings for Larger Field of View

Results

Comparison with State of the Art

BibTeX