Something-Else: Compositional Action Recognition

  • J. Materzynska, T. Xiao, R. Herzig, H. Xu, X. Wang, and T. Darrell, “Something-Else: Compositional Action Recognition with Spatial-Temporal Interaction Networks,” arXiv:1912.09930 [cs], Sep. 2020, Accessed: Jan. 26, 2022. [Online]. Available:


Do current ML algorithms have the capability to generalize across different combinations of verbs and nouns (compositionality)?

Turns out current methods like I3D with scene-level convolutional operators heavily relies on spatial appearance rather than temporal transfromations or geometric relations, and thus failed this task.

This paper introduces a model that relies on a sparse and semantically-rich object graph learnt for each action, learning explicit relations between subjects and objects.

It also introduces something-else dataset and a compositional action recognition task where training and testing data are split in a way to test the ability of compositionality of the model.


Improving Token-Mixer

The spatial interaction mudule looks just like a token-mixer. Probably changing it to self-attention to improve performance?


Spatial-Temporal Interaction Networks

Figure 1: Architecture Overview

Figure 1: Architecture Overview

STIN = Object Detector + Tracker + Reasoning

By modeling the transfromation of object geometric relations in a video, STIN can generalize well to unseen compositions.

Object-centric Representation

A video with T frames => Object detection: hands, generic candidate consituent objects.

Extracting two types ob feature representation for each bounding box

  1. Bouding box coordinates:
    • center coordinate + width + height => MLP => d-dimensional feature
  2. Object Identity: Another d-dimensional dembedding to represent identities of objects and subjects
    1. Subject embedding(agent): representing hands in an action
    2. Object embedding: representing objects invovled in the action
    3. Null embedding: representing dummy boxes irrelevant to the action

Spatial-temporal Interaction Reasoning

Given \(T\) video frames and \(N\) objects per frame, we denote the set of object features as \(X=\) \(\left(x_{1}^{1}, \ldots, x_{N}^{1}, x_{1}^{2}, \ldots, x_{N}^{2}, \ldots, x_{N}^{T}\right)\), where \(x_{i}^{t}\) represents the feature of object \(i\) in frame \(t\).

  • Spatial Interaction Module

    Perfoming spatial interaction reasoning amnong N objects in each frame.

    \[ f\left(x_{i}^{t}\right)=\operatorname{ReLU}\left(W_{f}^{T}\left[x_{i}^{t}, \frac{1}{N-1} \sum_{j \neq i} x_{j}^{t}\right]\right) \]

    where [,] denotes concatenation of two features in the channel dimension and \(W_{f}^{T}\) is learnable weights implemented by a fully connected layer.

  • Temporal Interaction Module

    Given these aggregated feature of objects in each frame, performing temporal reasoning on top of them.

    \[ p(X)=W_{p}^{T} h\left(\left\{g\left(x_{i}^{1}, \ldots, x_{i}^{T}\right)\right\}_{i=1}^{N}\right) \]

    where \(h\) is a function combining and aggregating the information of tracklets.

    In this study, two different approaches are proposed to combine tracklets:

    1. Design \(h\) as a simple averaging function to prove the effectiveness of our spatial-temporal interaction reasoning.
    2. Utilize non-local block as the function \(h\).The non-local block encodes the pairwise relationships between every two trajectory features before averaging them.
  • Combining Video Appearance Representation

    3D conv backbone to transform video into a d-dimentional feature, the appearance representation is especially helpful when the action has no prominent inter-object dynamics

    Video appearance representations are concatenated with object representations \(h\left(\left\{g\left(x_{i}^{1}, \ldots, x_{i}^{T}\right)\right\}_{i=1}^{N}\right)\), before fed into the classifier.