Instruction-driven history-aware policies for robotic manipulations

Pierre-Louis Guhur¹, Shizhe Chen ¹, Ricardo Garcia ¹, Makarand Tapaswi ², Ivan Laptev ¹, Cordelia Schmid ¹

¹Inria, École normale supérieure, CNRS, PSL Research University
²IIIT Hyderabad

Hiveformer can adapt to simultaneously perform 74 tasks from RLBench given language instructions. Note that tasks can have multiple variations, such as the push buttons task. We test our model on unseen variations on such tasks.

Hiveformer jointly models instructions, views from multiple cameras, and historical actions and observations with a multimodal transformer for robotic manipulation.