ROVER: Recursive Reasoning Over Videos with Vision-Language Models for Embodied Tasks

Vision-language models (VLMs) have exhibited impressive capabilities across diverse image understanding tasks, but still struggle in settings that require reasoning over extended sequences of camera frames from a video. This limits their utility in embodied settings, which require reasoning over long frame sequences from a continuous stream of visual input at each moment of a task attempt. To address this limitation, we propose ROVER (Reasoning Over VidEo Recursively), a framework that enables the model to recursively decompose long-horizon video trajectories into segments corresponding to shorter subtasks within the trajectory. In doing so, ROVER facilitates more focused and accurate reasoning over temporally localized frame sequences without losing global context. We evaluate ROVER, implemented using an in-context learning approach, on diverse OpenX Embodiment videos and on a new dataset derived from RoboCasa that consists of 543 videos showing both expert and perturbed non-expert trajectories across 27 robotic manipulation tasks. ROVER outperforms strong baselines across three video reasoning tasks: task progress estimation, frame-level natural language reasoning, and video question answering. We observe that, by reducing the number of frames the model reasons over at each timestep, ROVER mitigates hallucinations, especially during unexpected or non-optimal moments of a trajectory. In addition, by enabling the implementation of a subtask-specific sliding context window, ROVER's time complexity scales linearly with video length, an asymptotic improvement over baselines.

Method Overview

Reasoning Over VidEo Recursively (ROVER).

ROVER is a recursive framework for reasoning over camera video that decomposes a task into subtasks to maintain a compact temporal context, improving reasoning accuracy and efficiency. Instead of generating a long, single line of reasoning spanning all timesteps of the video input for a task attempt (e.g., opening a microwave door), ROVER decomposes the task and generates a separate line of reasoning for each subtask (e.g., grasping the microwave door handle). When a subtask is complete, the corresponding line of reasoning terminates and a new one is created for the video input for the next subtask (e.g., pulling the door open). We show that the decomposition not only improves accuracy by focusing the reasoning on relevant temporal segments, but also enables the implementation of a subtask-specific sliding context window, which further reduces the number of frames the model must reason over at each moment of a trajectory.

Experiments

We evaluate ROVER, implemented using an in-context learning approach, in the setting of robotic manipulation tasks using a large-scale dataset of videos collected from robot-mounted and third-person camera viewpoints during both successful and unsuccessful task attempts. We create this dataset by automatically perturbing expert demonstrations collected in RoboCasa to produce diverse trajectories, ranging from near-optimal to fully random action sequences. In addition, we compute ground-truth task progress estimates at each timestep based on geometric distance to goal states. The generated dataset includes 543 videos across 27 tasks, each collected in a random kitchen scene.

We leverage this dataset to evaluate ROVER across three benchmarks for embodied reasoning over camera video:

Frame-level task progress estimation
Frame-level natural language reasoning
Video question answering (QA)

1) ROVER improves frame-level progress prediction

For videos that exhibit task completion with near-expert behavior (i.e., the highest level within each task group), ROVER, GVL, and TemporalConcat achieve a Pearson correlation with ground-truth progress estimates near or above 0.5 across most task groups. However, for videos with incomplete task execution, GVL and TemporalConcat deviate significantly from the ground truth. These deviations become more extreme as the trajectory level decreases (i.e., as the number of non-expert states in the video increases).

2) ROVER achieves lower error rate in frame-level reasoning

The reasoning error rate is similar for all methods for videos containing near-expert task completion. However, as videos deviates from expert behavior, the error rate increases significantly for GVL and TemporalConcat. These trends mirror the results of the progress prediction task, suggesting that errors in progress prediction result from errors in the natural language descriptions of the frame that precedes the progress prediction at each timestep. We further evaluate the nature of these errors in the video QA analysis below.

3) ROVER reduces hallucinations during videos exhibiting non-expert behavior

ROVER shows significantly higher accuracy than GVL and TemporalConcat on the video QA benchmark across all task groups. The video QA results reveal a hallucination problem with GVL and TemporalConcat, where the VLM is likely to state that an event occurred during a video, regardless of whether it actually occurs. This is illustrated by the low precision (20 to 50\%) and high recall values (near 100\%) across all task groups for GVL and TemporalConcat. The hallucination problem is also highlighted in the analysis of the distance between the time when something occurs in a video and the time when the VLM states that it occurred. We see that, even when GVL correctly states something occurs during a video, it is much more likely than ROVER to state this prematurely (as shown by the negative frame difference in the figure below).

BibTeX

@preprint{schroeder2025rover,
    author    = {Schroeder, Philip and Biza, Ondrej and Weng, Thomas and Luo, Hongyin and Glass, James},
    title     = {ROVER: Recursive Reasoning Over Videos with Vision-Language Models for Embodied Tasks},
    booktitle = {preprint},
    year      = {2025}
}