- N. Reimers and I. Gurevych, “Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks,” arXiv:1908.10084 [cs], Aug. 2019, Accessed: Feb. 07, 2022. [Online]. Available: http://arxiv.org/abs/1908.10084
When conducting semantic textual similarity tasks out of box, Bert requires both sentences are fed into the network, which is computationally ineffcient.
On the other hand, weh using BERT out-of-the-box to map sentences to a vector space, the result is rather unsuitable to be used with common similarity measures like cosine-similarity.
To overcome this shortcoming, Sentence-BERT is proposed in this paper, which adds a pooling operation on top of Bert and is fine-tuned in a siamese / triplet network architecture. It demonstrates SOTA performance.
Embodied Semantic Scene Graph Generation
- X. Li, D. Guo, H. Liu, and F. Sun, “Embodied Semantic Scene Graph Generation,” in Proceedings of the 5th Conference on Robot Learning, Jan. 2022, pp. 1585–1594. Accessed: Feb. 07, 2022. [Online]. Available: https://proceedings.mlr.press/v164/li22e.html
The goal of this work is to enable the agent to automatically generate a sequence of actions to explore the environment and build the corresponding semantic scene graph incrementally.
Scene Graph Generation
Follow the graph convolution network based scene graph generation method.
The generated local semantic scene graph includes the current in-sight objects and their relations with each other.
They additionally introduce object class embedding and bounding box coordinates as the input for each object node, and the training objective is to restore the bounding box coordinates and labels for each object and edge.
At each time step, the agent takes an action to move, and the equipped camera captures the RGB and depth frames of the scene to construct a local scene graph, which is further merged into the global scene graph. During this period, the detected objects in the local scene graph and the existing objects in the previous global scene graph are aligned by matching point clouds.
The goal of the proposed navigation model is to guide the agent to take actions to explore the environment and build the semantic scene graph incrementally.
Action Space: Move in 8 Direction + No Move + Move Camera by 90 degree
Modeled by a LSTM with input from previous action, RGB camera feature, local & global graph feature
Hard to tain directly => Imitation + RL training
Automatically design an “optimal path” and train model on it
Further improve performance by RL with loss about graph quality and path length
- Y. Hu et al., “DiffTaichi: Differentiable Programming for Physical Simulation,” arXiv:1910.00935 [physics, stat], Feb. 2020, Accessed: Feb. 07, 2022. [Online]. Available: http://arxiv.org/abs/1910.00935
A general-purpose framework to ease researchers from low-level hack to accelerate computation, just like what pytorch did in ML.
Decouple computation from data structure.