There are a total of 142 example videos across 27 tasks. The videos are separated into levels based on the amount of the task completed during the video. The amount of non-expert behavior in the video increases as the level decreases. The highest-level trajectories for each task show full task completion with near-expert behavior. The level 1 trajectories do not achieve any part of the task.
The task progress values generated by ROVER and GVL are shown in yellow and blue, respectively. The ground-truth progress values are shown in gray.
The performance metrics for the task progress prediction include the Pearson correlation coefficient (higher is better) and the L2 distance (lower is better) between the ground-truth progress values and the progress values generated by each method (ROVER vs GVL). These metrics are shown as "Corr." and "Dist." on the right in each video.