- K. Grauman et al., “Ego4D: Around the World in 3,000 Hours of Egocentric Video,” Oct. 2021, Accessed: Jan. 16, 2022. [Online]. Available: https://arxiv.org/abs/2110.07058v1
Goal: large-scale “in the wild” egocentric video dataset
- 3025 hours from 74 cities & 9 countries
- 855 camera wearers
- Multi-modal: audio, 3Dscans, IMU, stero sound, multi-camera…
Motivation for Egocentric Video
- Application in AR, VR
- Fun fact: this dataset is mainly built by FAIR
- Robot Learning
Where is my X?
Egocentric video gives a recording of a wearer’s daily life, and can be used augment human memory on demand. Such a system might be able to remind a user where they left their keys, if they added salt to a recipe, or recall events they attended.
There are three different tasks within this benchmark based on the input type used to query the memory: visual query (i.e. find the location given an image of keys), textual query (“how many cups of sugar did I add?"), and a moment query (find all instances of “When did I play with the dog”).
Construction of Queries
For the language queries, a set of templates were designed which annotators used to write questions for the task. Examples include “what is the state of object X?” or “where is object X after event Y”? These were then re-written for variety.
Given the broad nature of this benchmark, there isn’t a subset of activities that were focused on within this task, leading to a realistic and challenging benchmark.
Human Object Interaction
Hand + Object Interaction
How do objects change during interactions? Going beyond Action Recognition, this benchmark follows when, where and how an object is changed during its interaction - only possible through a first person Viewpoint.
Change of State
We capture annotations of objects, as they transform, temporally, spatially and semantically - an onion might be minced. These are represented by three different tasks in the benchmark: Point-of-no-return Temporal Localisation, Active Object Detection and State-Change Classification.
Each annotation has been labelled with prior states (i.e. the prior condition) and posterior states as well as the point of no return (PNR) in which the state change is triggered.
World of Interactions
The data for this challenge has been selected from activities with a high level of hand-object interactions such as knitting, carpentry, and baking.
Who said what, and when?
Conversations are egocentric in nature, and a human-in-the-loop AI requires skills such as localizing a speaker and transcribing speech content
Looking for Conversation
This benchmark contains 2 different tasks focused on visual data: localizing and tracking of the speakers in the visual field of view. Note that identities are anonymized to match consortium guidelines.
Hearing the Words
The benchmark also includes 2 tasks for the audio modality: diaraization/temporal extent of the sentences spoken and the transcription of the conversation.
Much Ado About Talking
With this task focused on conversations, scenarios were chosen which included multiple participants interacting together, such as eating, playing games or setting up tents.
Who is attending to whom?
An egocentric video provides a unique lens for studying social interactions because it captures utterances and nonverbal cues from each participant’s unique view and enables embodied approaches to social understanding.
More than Conversation
Social extends the Audio-Visual Diaraization benchmark towards understanding the conversations of a social group over a longer period of time for specific tasks.
Talking and Listening
This benchmark includes two different tasks focused on when a person is Looking at Me and when a person is Talking to Me.
The data within the Social Interaction task was collected specifically for this task in mind with multi-user scenarios such as social deduction games, eating/drinking and playing basketball.
Predicting the future is a critical skill for AI systems to provide timely assistance for users. With a myriad of long-form, unscripted videos, Ego4D provides an interesting challenge for different forecasting tasks.
Where Will I Move
Two tasks consider the future motion of the user with hands and feet. Models should predict where the camera wearer will go within the scene and the future location of wearer’s hands.
What Will Happen Next?
Two tasks consider short and long term future anticipation. Algorithms should be able to predict the next object interaction that will take place and a countdown towards it taking place as well as the long term - what are the next possible sequence of actions?
Data for Prophets
The data for this challenge has been selected from a diverse set of activities containing many human-object interactions and movements such as brick making, cooking or carpentry.