# Something-Else: Compositional Action Recognition

• J. Materzynska, T. Xiao, R. Herzig, H. Xu, X. Wang, and T. Darrell, “Something-Else: Compositional Action Recognition with Spatial-Temporal Interaction Networks,” arXiv:1912.09930 [cs], Sep. 2020, Accessed: Jan. 26, 2022. [Online]. Available: http://arxiv.org/abs/1912.09930

## Abstract

Do current ML algorithms have the capability to generalize across different combinations of verbs and nouns (compositionality)?

Turns out current methods like I3D with scene-level convolutional operators heavily relies on spatial appearance rather than temporal transfromations or geometric relations, and thus failed this task.

This paper introduces a model that relies on a sparse and semantically-rich object graph learnt for each action, learning explicit relations between subjects and objects.

It also introduces something-else dataset and a compositional action recognition task where training and testing data are split in a way to test the ability of compositionality of the model.

## Note

### Improving Token-Mixer

The spatial interaction mudule looks just like a token-mixer. Probably changing it to self-attention to improve performance?

## Approach

### Spatial-Temporal Interaction Networks

STIN = Object Detector + Tracker + Reasoning

By modeling the transfromation of object geometric relations in a video, STIN can generalize well to unseen compositions.

#### Object-centric Representation

A video with T frames => Object detection: hands, generic candidate consituent objects.

Extracting two types ob feature representation for each bounding box

1. Bouding box coordinates:
• center coordinate + width + height => MLP => d-dimensional feature
2. Object Identity: Another d-dimensional dembedding to represent identities of objects and subjects
1. Subject embedding(agent): representing hands in an action
2. Object embedding: representing objects invovled in the action
3. Null embedding: representing dummy boxes irrelevant to the action

#### Spatial-temporal Interaction Reasoning

Given $$T$$ video frames and $$N$$ objects per frame, we denote the set of object features as $$X=$$ $$\left(x_{1}^{1}, \ldots, x_{N}^{1}, x_{1}^{2}, \ldots, x_{N}^{2}, \ldots, x_{N}^{T}\right)$$, where $$x_{i}^{t}$$ represents the feature of object $$i$$ in frame $$t$$.

• Spatial Interaction Module

Perfoming spatial interaction reasoning amnong N objects in each frame.

$f\left(x_{i}^{t}\right)=\operatorname{ReLU}\left(W_{f}^{T}\left[x_{i}^{t}, \frac{1}{N-1} \sum_{j \neq i} x_{j}^{t}\right]\right)$

where [,] denotes concatenation of two features in the channel dimension and $$W_{f}^{T}$$ is learnable weights implemented by a fully connected layer.

• Temporal Interaction Module

Given these aggregated feature of objects in each frame, performing temporal reasoning on top of them.

$p(X)=W_{p}^{T} h\left(\left\{g\left(x_{i}^{1}, \ldots, x_{i}^{T}\right)\right\}_{i=1}^{N}\right)$

where $$h$$ is a function combining and aggregating the information of tracklets.

In this study, two different approaches are proposed to combine tracklets:

1. Design $$h$$ as a simple averaging function to prove the effectiveness of our spatial-temporal interaction reasoning.
2. Utilize non-local block as the function $$h$$.The non-local block encodes the pairwise relationships between every two trajectory features before averaging them.
• Combining Video Appearance Representation

3D conv backbone to transform video into a d-dimentional feature, the appearance representation is especially helpful when the action has no prominent inter-object dynamics

Video appearance representations are concatenated with object representations $$h\left(\left\{g\left(x_{i}^{1}, \ldots, x_{i}^{T}\right)\right\}_{i=1}^{N}\right)$$, before fed into the classifier.