Paper Dash - 2022.02.07


  • N. Reimers and I. Gurevych, “Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks,” arXiv:1908.10084 [cs], Aug. 2019, Accessed: Feb. 07, 2022. [Online]. Available:

When conducting semantic textual similarity tasks out of box, Bert requires both sentences are fed into the network, which is computationally ineffcient.

On the other hand, weh using BERT out-of-the-box to map sentences to a vector space, the result is rather unsuitable to be used with common similarity measures like cosine-similarity.

To overcome this shortcoming, Sentence-BERT is proposed in this paper, which adds a pooling operation on top of Bert and is fine-tuned in a siamese / triplet network architecture. It demonstrates SOTA performance.

Embodied Semantic Scene Graph Generation

  • X. Li, D. Guo, H. Liu, and F. Sun, “Embodied Semantic Scene Graph Generation,” in Proceedings of the 5th Conference on Robot Learning, Jan. 2022, pp. 1585–1594. Accessed: Feb. 07, 2022. [Online]. Available:

The goal of this work is to enable the agent to automatically generate a sequence of actions to explore the environment and build the corresponding semantic scene graph incrementally.


Scene Graph Generation

  • Local

    Follow the graph convolution network based scene graph generation method.

    The generated local semantic scene graph includes the current in-sight objects and their relations with each other.

    They additionally introduce object class embedding and bounding box coordinates as the input for each object node, and the training objective is to restore the bounding box coordinates and labels for each object and edge.

  • Global

    At each time step, the agent takes an action to move, and the equipped camera captures the RGB and depth frames of the scene to construct a local scene graph, which is further merged into the global scene graph. During this period, the detected objects in the local scene graph and the existing objects in the previous global scene graph are aligned by matching point clouds.

The goal of the proposed navigation model is to guide the agent to take actions to explore the environment and build the semantic scene graph incrementally.

Action Space: Move in 8 Direction + No Move + Move Camera by 90 degree

Modeled by a LSTM with input from previous action, RGB camera feature, local & global graph feature


Hard to tain directly => Imitation + RL training

  • Imitation

    Automatically design an “optimal path” and train model on it

  • RL

    Further improve performance by RL with loss about graph quality and path length


  • Y. Hu et al., “DiffTaichi: Differentiable Programming for Physical Simulation,” arXiv:1910.00935 [physics, stat], Feb. 2020, Accessed: Feb. 07, 2022. [Online]. Available:

A general-purpose framework to ease researchers from low-level hack to accelerate computation, just like what pytorch did in ML.

Decouple computation from data structure.