We compare the 0-shot performance of a neural caption-based image retriever when given
as input either human-produced captions or
captions generated by a neural captioner. We
conduct this comparison on the recently introduced IMAGECODE data-set (Krojer et al.,
2022), which contains hard distractors nearly
identical to the images to be retrieved. We find
that the neural retriever has much higher performance when fed neural rather than human captions, despite the fact that the former, unlike ...
We compare the 0-shot performance of a neural caption-based image retriever when given
as input either human-produced captions or
captions generated by a neural captioner. We
conduct this comparison on the recently introduced IMAGECODE data-set (Krojer et al.,
2022), which contains hard distractors nearly
identical to the images to be retrieved. We find
that the neural retriever has much higher performance when fed neural rather than human captions, despite the fact that the former, unlike the
latter, were generated without awareness of the
distractors that make the task hard. Even more
remarkably, when the same neural captions are
given to human subjects, their retrieval performance is almost at chance level. Our results
thus add to the growing body of evidence that,
even when the “language” of neural models resembles English, this superficial resemblance
might be deeply misleading.
+