Something-Else: Compositional Action Recognition
- J. Materzynska, T. Xiao, R. Herzig, H. Xu, X. Wang, and T. Darrell, “Something-Else: Compositional Action Recognition with Spatial-Temporal Interaction Networks,” arXiv:1912.09930 [cs], Sep. 2020, Accessed: Jan. 26, 2022. [Online]. Available: http://arxiv.org/abs/1912.09930
Abstract
Do current ML algorithms have the capability to generalize across different combinations of verbs and nouns (compositionality)?
Turns out current methods like I3D with scene-level convolutional operators heavily relies on spatial appearance rather than temporal transfromations or geometric relations, and thus failed this task.
This paper introduces a model that relies on a sparse and semantically-rich object graph learnt for each action, learning explicit relations between subjects and objects.
It also introduces something-else dataset and a compositional action recognition task where training and testing data are split in a way to test the ability of compositionality of the model.
Note
Improving Token-Mixer
The spatial interaction mudule looks just like a token-mixer. Probably changing it to self-attention to improve performance?
Approach
Spatial-Temporal Interaction Networks

Figure 1: Architecture Overview
STIN = Object Detector + Tracker + Reasoning
By modeling the transfromation of object geometric relations in a video, STIN can generalize well to unseen compositions.
Object-centric Representation
A video with T frames => Object detection: hands, generic candidate consituent objects.
Extracting two types ob feature representation for each bounding box
- Bouding box coordinates:
- center coordinate + width + height => MLP => d-dimensional feature
- Object Identity: Another d-dimensional dembedding to represent identities of objects and subjects
- Subject embedding(agent): representing hands in an action
- Object embedding: representing objects invovled in the action
- Null embedding: representing dummy boxes irrelevant to the action
Spatial-temporal Interaction Reasoning
Given \(T\) video frames and \(N\) objects per frame, we denote the set of object features as \(X=\) \(\left(x_{1}^{1}, \ldots, x_{N}^{1}, x_{1}^{2}, \ldots, x_{N}^{2}, \ldots, x_{N}^{T}\right)\), where \(x_{i}^{t}\) represents the feature of object \(i\) in frame \(t\).
-
Spatial Interaction Module
Perfoming spatial interaction reasoning amnong N objects in each frame.
\[ f\left(x_{i}^{t}\right)=\operatorname{ReLU}\left(W_{f}^{T}\left[x_{i}^{t}, \frac{1}{N-1} \sum_{j \neq i} x_{j}^{t}\right]\right) \]
where [,] denotes concatenation of two features in the channel dimension and \(W_{f}^{T}\) is learnable weights implemented by a fully connected layer.
-
Temporal Interaction Module
Given these aggregated feature of objects in each frame, performing temporal reasoning on top of them.
\[ p(X)=W_{p}^{T} h\left(\left\{g\left(x_{i}^{1}, \ldots, x_{i}^{T}\right)\right\}_{i=1}^{N}\right) \]
where \(h\) is a function combining and aggregating the information of tracklets.
In this study, two different approaches are proposed to combine tracklets:
- Design \(h\) as a simple averaging function to prove the effectiveness of our spatial-temporal interaction reasoning.
- Utilize non-local block as the function \(h\).The non-local block encodes the pairwise relationships between every two trajectory features before averaging them.
-
Combining Video Appearance Representation
3D conv backbone to transform video into a d-dimentional feature, the appearance representation is especially helpful when the action has no prominent inter-object dynamics
Video appearance representations are concatenated with object representations \(h\left(\left\{g\left(x_{i}^{1}, \ldots, x_{i}^{T}\right)\right\}_{i=1}^{N}\right)\), before fed into the classifier.