Human can acquire detailed knowledge of visual word meaning without first-person sensory access
- M. Bedny, J. Koster-Hale, G. Elli, L. Yazzolino, and R. Saxe, “There’s more to ‘sparkle’ than meets the eye: Knowledge of vision and light verbs among congenitally blind and sighted individuals,” Cognition, vol. 189, pp. 105–115, Aug. 2019, doi: 10.1016/j.cognition.2019.03.017.
How rich is blind individuals' knowledge about vision and how similar is it to the knowledge of sighted people
An experiment in CogSci: asking blind/normal people to judge semantic similarity of similar visual verbs in three categorizes: visual perception/ tectile perception/ amodal knowledge acquisition.
Relative to the sighted, blind speakers had higher agreement among themselves on touch perception and sound emission verbs. However, for visual verbs, the judgments of blind and sighted participants were indistinguishable
Image–Language Transformers are NOT good verb leaners
- L. A. Hendricks and A. Nematzadeh, “Probing Image-Language Transformers for Verb Understanding,” arXiv:2106.09141 [cs], Jun. 2021, Accessed: Feb. 05, 2022. [Online]. Available: http://arxiv.org/abs/2106.09141
A novel benchmark focused on verbs called SVO-Probes for examining subject, verb, object triplets; more specifically, it is a set of image–sentence pairs (in English) where each pair is annotated with whether the sentence corresponds to the image or not. The dataset consists of 421 verbs and includes over 48,000 image–sentence pairs.
Despite good performance on downstream tasks, image–language transformers fail on our task that requires multimodal understanding since they cannot distinguish between finer-grained differences between images.
The results highlight that there is still considerable progress to be made when training multimodal representations, and that verbs in particular are an interesting challenge in image–language representation learning