Collocations in the sense of idiosyncratic lexical co-occurrences of two syntactically bound
words traditionally pose a challenge to language learners and many Natural Language Processing
(NLP) applications alike. Reliable ground truth (i.e., ideally manually compiled) resources are
thus of high value. We present a manually compiled bilingual English–French collocation resource
with 7,480 collocations in English and 6,733 in French. Each collocation is enriched with
information that facilitates ...
Collocations in the sense of idiosyncratic lexical co-occurrences of two syntactically bound
words traditionally pose a challenge to language learners and many Natural Language Processing
(NLP) applications alike. Reliable ground truth (i.e., ideally manually compiled) resources are
thus of high value. We present a manually compiled bilingual English–French collocation resource
with 7,480 collocations in English and 6,733 in French. Each collocation is enriched with
information that facilitates its downstream exploitation in NLP tasks such as machine translation,
word sense disambiguation, natural language generation, relation classification, and so forth. Our
proposed enrichment covers: the semantic category of the collocation (its lexical function), its
vector space representation (for each individual word as well as their joint collocation embedding),
a subcategorization pattern of both its elements, as well as their corresponding BabelNet
id, and finally, indices of their occurrences in large scale reference corpora.
+