Neural captioners are typically trained to mimic humangenerated references without optimizing for any specific
communication goal, leading to problems such as the generation of vague captions. In this paper, we show that
fine-tuning an out-of-the-box neural captioner with a selfsupervised discriminative communication objective helps to
recover a plain, visually descriptive language that is more
informative about image contents. Given a target image,
the system must learn to produce a description ...
Neural captioners are typically trained to mimic humangenerated references without optimizing for any specific
communication goal, leading to problems such as the generation of vague captions. In this paper, we show that
fine-tuning an out-of-the-box neural captioner with a selfsupervised discriminative communication objective helps to
recover a plain, visually descriptive language that is more
informative about image contents. Given a target image,
the system must learn to produce a description that enables
an out-of-the-box text-conditioned image retriever to identify such image among a set of candidates. We experiment
with the popular ClipCap captioner, also replicating the
main results with BLIP. In terms of similarity to groundtruth human descriptions, the captions emerging from discriminative finetuning lag slightly behind those generated
by the non-finetuned model, when the latter is trained and
tested on the same caption dataset. However, when the
model is used without further tuning to generate captions
for out-of-domain datasets, our discriminatively-finetuned
captioner generates descriptions that resemble human references more than those produced by the same captioner
without finetuning. We further show that, on the Conceptual Captions dataset, discriminatively finetuned captions
are more helpful than either vanilla ClipCap captions or
ground-truth captions for human annotators tasked with an
image discrimination task
+