SomethingElse: Compositional Action Recognition
 J. Materzynska, T. Xiao, R. Herzig, H. Xu, X. Wang, and T. Darrell, “SomethingElse: Compositional Action Recognition with SpatialTemporal Interaction Networks,” arXiv:1912.09930 [cs], Sep. 2020, Accessed: Jan. 26, 2022. [Online]. Available: http://arxiv.org/abs/1912.09930
Abstract
Do current ML algorithms have the capability to generalize across different combinations of verbs and nouns (compositionality)?
Turns out current methods like I3D with scenelevel convolutional operators heavily relies on spatial appearance rather than temporal transfromations or geometric relations, and thus failed this task.
This paper introduces a model that relies on a sparse and semanticallyrich object graph learnt for each action, learning explicit relations between subjects and objects.
It also introduces somethingelse dataset and a compositional action recognition task where training and testing data are split in a way to test the ability of compositionality of the model.
Note
Improving TokenMixer
The spatial interaction mudule looks just like a tokenmixer. Probably changing it to selfattention to improve performance?
Approach
SpatialTemporal Interaction Networks
STIN = Object Detector + Tracker + Reasoning
By modeling the transfromation of object geometric relations in a video, STIN can generalize well to unseen compositions.
Objectcentric Representation
A video with T frames => Object detection: hands, generic candidate consituent objects.
Extracting two types ob feature representation for each bounding box
 Bouding box coordinates:
 center coordinate + width + height => MLP => ddimensional feature
 Object Identity: Another ddimensional dembedding to represent identities of objects and subjects
 Subject embedding(agent): representing hands in an action
 Object embedding: representing objects invovled in the action
 Null embedding: representing dummy boxes irrelevant to the action
Spatialtemporal Interaction Reasoning
Given \(T\) video frames and \(N\) objects per frame, we denote the set of object features as \(X=\) \(\left(x_{1}^{1}, \ldots, x_{N}^{1}, x_{1}^{2}, \ldots, x_{N}^{2}, \ldots, x_{N}^{T}\right)\), where \(x_{i}^{t}\) represents the feature of object \(i\) in frame \(t\).

Spatial Interaction Module
Perfoming spatial interaction reasoning amnong N objects in each frame.
\[ f\left(x_{i}^{t}\right)=\operatorname{ReLU}\left(W_{f}^{T}\left[x_{i}^{t}, \frac{1}{N1} \sum_{j \neq i} x_{j}^{t}\right]\right) \]
where [,] denotes concatenation of two features in the channel dimension and \(W_{f}^{T}\) is learnable weights implemented by a fully connected layer.

Temporal Interaction Module
Given these aggregated feature of objects in each frame, performing temporal reasoning on top of them.
\[ p(X)=W_{p}^{T} h\left(\left\{g\left(x_{i}^{1}, \ldots, x_{i}^{T}\right)\right\}_{i=1}^{N}\right) \]
where \(h\) is a function combining and aggregating the information of tracklets.
In this study, two different approaches are proposed to combine tracklets:
 Design \(h\) as a simple averaging function to prove the effectiveness of our spatialtemporal interaction reasoning.
 Utilize nonlocal block as the function \(h\).The nonlocal block encodes the pairwise relationships between every two trajectory features before averaging them.

Combining Video Appearance Representation
3D conv backbone to transform video into a ddimentional feature, the appearance representation is especially helpful when the action has no prominent interobject dynamics
Video appearance representations are concatenated with object representations \(h\left(\left\{g\left(x_{i}^{1}, \ldots, x_{i}^{T}\right)\right\}_{i=1}^{N}\right)\), before fed into the classifier.