Ego4D Dataset: Advancing Multimodal Perception of Egocentric Video

  • K. Grauman et al., “Ego4D: Around the World in 3,000 Hours of Egocentric Video,” Oct. 2021, Accessed: Jan. 16, 2022. [Online]. Available:

Quick Facts

Goal: large-scale “in the wild” egocentric video dataset


  • 3025 hours from 74 cities & 9 countries
  • 855 camera wearers
  • Multi-modal: audio, 3Dscans, IMU, stero sound, multi-camera…

Motivation for Egocentric Video

  • Application in AR, VR
    • Fun fact: this dataset is mainly built by FAIR
  • Robot Learning


Episodic Memory

Episodic Memory

Where is my X?

Egocentric video gives a recording of a wearer’s daily life, and can be used augment human memory on demand. Such a system might be able to remind a user where they left their keys, if they added salt to a recipe, or recall events they attended.

Querying Memory

There are three different tasks within this benchmark based on the input type used to query the memory: visual query (i.e. find the location given an image of keys), textual query (“how many cups of sugar did I add?"), and a moment query (find all instances of “When did I play with the dog”).

Construction of Queries

For the language queries, a set of templates were designed which annotators used to write questions for the task. Examples include “what is the state of object X?” or “where is object X after event Y”? These were then re-written for variety.

Recalling Lives

Given the broad nature of this benchmark, there isn’t a subset of activities that were focused on within this task, leading to a realistic and challenging benchmark.

Human Object Interaction

Hand + Object Interaction

How do objects change during interactions? Going beyond Action Recognition, this benchmark follows when, where and how an object is changed during its interaction - only possible through a first person Viewpoint.

Change of State

We capture annotations of objects, as they transform, temporally, spatially and semantically - an onion might be minced. These are represented by three different tasks in the benchmark: Point-of-no-return Temporal Localisation, Active Object Detection and State-Change Classification.

Pre/Post Conditions

Each annotation has been labelled with prior states (i.e. the prior condition) and posterior states as well as the point of no return (PNR) in which the state change is triggered.

World of Interactions

The data for this challenge has been selected from activities with a high level of hand-object interactions such as knitting, carpentry, and baking.

Audio-Visual Diarization

Audio-Visual Diarization

Who said what, and when?

Conversations are egocentric in nature, and a human-in-the-loop AI requires skills such as localizing a speaker and transcribing speech content

Looking for Conversation

This benchmark contains 2 different tasks focused on visual data: localizing and tracking of the speakers in the visual field of view. Note that identities are anonymized to match consortium guidelines.

Hearing the Words

The benchmark also includes 2 tasks for the audio modality: diaraization/temporal extent of the sentences spoken and the transcription of the conversation.

Much Ado About Talking

With this task focused on conversations, scenarios were chosen which included multiple participants interacting together, such as eating, playing games or setting up tents.


Social Interactions

Who is attending to whom?

An egocentric video provides a unique lens for studying social interactions because it captures utterances and nonverbal cues from each participant’s unique view and enables embodied approaches to social understanding.

More than Conversation

Social extends the Audio-Visual Diaraization benchmark towards understanding the conversations of a social group over a longer period of time for specific tasks.

Talking and Listening

This benchmark includes two different tasks focused on when a person is Looking at Me and when a person is Talking to Me.

Unqiue Interactions

The data within the Social Interaction task was collected specifically for this task in mind with multi-user scenarios such as social deduction games, eating/drinking and playing basketball.



Predicting the future is a critical skill for AI systems to provide timely assistance for users. With a myriad of long-form, unscripted videos, Ego4D provides an interesting challenge for different forecasting tasks.

Where Will I Move

Two tasks consider the future motion of the user with hands and feet. Models should predict where the camera wearer will go within the scene and the future location of wearer’s hands.

What Will Happen Next?

Two tasks consider short and long term future anticipation. Algorithms should be able to predict the next object interaction that will take place and a countdown towards it taking place as well as the long term - what are the next possible sequence of actions?

Data for Prophets

The data for this challenge has been selected from a diverse set of activities containing many human-object interactions and movements such as brick making, cooking or carpentry.