Semantics-Driven Recognition of Collocations Using Word Embeddings

L2 learners often produce “ungrammat-ical” word combinations such as, e.g., * give a suggestion or * make a walk . This is because of the “collocationality” of one of their items (the base ) that limits the acceptance of collocates to express a spe-ciﬁc meaning (‘perform’ above). We pro-pose an algorithm that delivers, for a given base and the intended meaning of a collo-cate, the actual collocate lexeme(s) ( make / take above). The algorithm exploits the linear mapping between bases and collo-cates from examples and generates a collocation transformation matrix which is then applied to novel unseen cases. The evaluation shows a promising line of research in collocation discovery.


Introduction
Collocations of the kind make [a] suggestion, attend [a] lecture, heavy rain, deep thought, strong tea, etc., are restricted lexical co-occurrences of two syntactically bound lexical elements (Kilgarriff, 2006). The central role of collocations for second language (henceforth, L2) learning has been discussed in a series of theoretical and empirical studies (Hausmann, 1984;Bahns and Eldaw, 1993;Granger, 1998;Lewis and Conzett, 2000;Nesselhauf, 2005;Alonso Ramos et al., 2010) and is widely reflected in (especially English) learner dictionaries. In computational lexicography, several statistical measures have been used to retrieve collocations from corpora, among them, mutual information (Church and Hanks, 1989;Lin, 1999), entropy (Kilgarriff, 2006), pointwise mutual information (Bouma, 2010), and weighted pointwise mutual information (Carlini et al., 2014). 1 However, the needs of language learners go beyond mere lists of collocations: the cited studies reveal that language learners often build "miscollocations" (as, e.g., *give a suggestion or *have the curiosity) to express the intended meaning. In other words, they fail to observe, in Kilgarriff's terms, the "collocationality" restrictions of L2, which imply that in language production, one of the elements of a collocation (the base) is freely chosen, while the choice of the other (the collocate) depends on the base (Hausmann, 1989;Cowie, 1994). For instance, to express the meaning of 'do' or 'perform', the base suggestion prompts for the choice of make as collocate: make [a] suggestion, while advice prompts for give: give [an] advice; to express the meaning of 'participate in', lecture prompts for attend: attend [a] lecture, while operation prompts for assist: assist [an] operation; to express the meaning of 'intense' in connection with rain, the right collocate is heavy, while 'intense wind' is strong wind. And so on. The idiosyncrasy of collocations makes them also language-specific. Thus, in English, you take [a] walk, in Spanish you 'give' it (dar [un] paseo), and in German and French you 'make' it ([einen] Spaziergang machen, faire [une] promenade); in English, rain is heavy, while in Spanish and German it is 'strong' (fuerte lluvia/starker Regen).
In order to effectively support L2 learners, techniques are thus needed that are able not only to retrieve collocations, but also provide for a given base (or headword) and a given semantic gloss of a collocate meaning, the actual collocate lexeme.
In what follows, we present such a technique, which is grounded in Mikolov et al. (2013c)'s word embeddings, and which leverages the fact that semantically related words in two different vector representations are related by linear transformation (Mikolov et al., 2013b). This property has been exploited for word-based translation Mikolov et al. (2013b), learning semantic hierarchies (hyponym-hypernym relations) in Chinese (Fu et al., 2014), and modeling linguistic similarities between standard (Wikipedia) and nonstandard language (Twitter) (Tan et al., 2015). In our task, we learn a transition matrix over a small number of collocation examples, where collocates share the same semantic gloss, to apply then this matrix to discover new collocates for any previously unseen collocation base. We discuss the outcome of the experiments with ten different collocate glosses (including 'do' / 'perform', 'increase', 'decrease', etc.), and show that for most glosses, an approach that combines a stage of the application of a gloss-specific transition matrix with a pruning stage that is based on statistical evidence outperforms approaches that exploit only one of these stages as well as a baseline that is based on collocation retrieval exploiting the embeddings property for drawing analogies, such as, e.g., x ∼ applause ≡ heavy ∼ rain (implying x=thunderous) (Rodríguez-Fernández et al., 2016).

Theoretical model
The semantic glosses of collocates across collocations can be generalized into a generic semantic typology modeled, e.g., by Mel'čuk (1996)'s Lexical Functions. For instance, absolute, deep, strong, heavy in absolute certainty, deep thought, strong wind, and heavy storm can all be glossed as 'intense'; make, take, give, carry out in make [a] proposal, take [a] step, give [a] hint, carry out [an] operation can be glossed as 'do'/'perform'; etc. Our goal is to capture the relation that holds between the training bases and the collocates with the same gloss, such that given a new base and a gloss, we can retrieve its corresponding collocate(s) with this gloss. Thus, given absolute certainty, deep thought, and strong wind as training examples, storm as input base and 'intense' as gloss, we aim at retrieving the collocate heavy. As already mentioned above, our approach is based on Mikolov et al. (2013b)'s linear transformation model, which associates word vector representations between two analogous spaces. In Mikolov et al.'s original work, one space captures words in language L 1 and the other space words in lan-guage L 2 , such that the found relations are between translation equivalents. In our case, we define a base space B and a collocate space C in order to relate bases with their collocates that have the same meaning, in the same language. To obtain the word vector representations in B and C, we use Mikolov et al. (2013c)'s word2vec. 2 The linear transformation model is constructed as follows. Let T be a set of collocations whose collocates share the semantic gloss τ , and let b t i and c t i be the collocate respectively base of the collocation . . c tn ] are given by their corresponding vector representations. Together, they constitute a set of training examples Φ τ , composed by vector . Φ τ is used to learn a linear transformation matrix Ψ τ ∈ R B×C . Following the notation in (Tan et al., 2015), this transformation can be depicted as: We follow Mikolov et al.'s original approach and compute Ψ τ as follows: Hence, for any given novel base b jτ , we obtain a novel list of ranked collocates by applying Ψ τ b jτ and filtering the resulting candidates by part of speech and N P M I, an association measure that is based on the pointwise mutual information, but takes into account the asymmetry of the lexical dependencies between a base and its collocate (Carlini et al., 2014): 3 Experiments

Setup of the Experiments
We carried out experiments with 10 of the most frequent semantic collocate glosses (listed in the first column of Table 1). As is common in previous work on semantic collocation classification (Moreno et al., 2013;, our training set consists of a list of manually annotated correct collocations. For this purpose, we  Table 1: Semantic glosses and size of training set randomly selected nouns from the Macmillan Dictionary and manually classified their corresponding collocates with respect to the glosses. 3 Note that there may be more than one collocate for each base. Since collocations with different collocate meanings are not evenly distributed in language (e.g., speakers use more often collocations conveying the idea of 'intense' and 'perform' than 'stop performing'), the number of instances per gloss in our training data also varies significantly (see Table 1). Due to the asymmetric nature of collocations, not all corpora may be equally suitable for the derivation of word embedding representations for both bases and collocates. Thus, we may hypothesize that for modeling (nominal) bases, which keep in collocations their literal meaning, a standard register corpus with a small percentage of figurative meanings will be more adequate, while for modeling collocates, a corpus which is potentially rich in collocations is likely to be more appropriate. In order to verify this hypothesis, we carried out two different experiments. In the first experiment, we used for both bases and collocates vectors pre-trained on the Google News corpus (GoogleVecs), which is available at word2vec's website. In the second experiment, the bases were modeled by training their word vectors over a 2014 dump of the English Wikipedia, while for modeling collocates, again, GoogleVecs has been used. In other words, we assumed that Wikipedia is a standard register corpus and thus better for modeling B, while GoogleVecs is more suitable for modeling C. The figures in Section 3.2 below will give us a hint whether this assumption is correct.
3 At this stage of our work, we considered only collocations that involve single word tokens for both the base and the collocate. In other words, we did not take into account, e.g., phrasal verb collocates such as stand up, give up or calm down. We also left aside the problem of subcategorization in collocations; cf., e.g., into in take [into] consideration.
For the calculation of N P M I during postprocessing, the British National Corpus (BNC) was used. 4

Evaluation
The outcome of each experiment was assessed by verifying the correctness of each retrieved candidate from the top-10 candidates obtained for each test base. A total of 10 bases was evaluated for each gloss. The ground truth test set was created in a similar fashion as the training set: nouns from the Macmillan Dictionary were randomly chosen, and their collocates manually classified in terms of the different glosses, until a set of ten unseen base-collocate pairs was obtained for each gloss.
For the outcome of each experiment, we computed both precision (p) as the ratio of retrieved collocates that match the targeted glosses to the overall number of obtained collocates for each base, and Mean Reciprocal Rank (MRR), which rewards the position of the first correct result in a ranked list of outcomes: where Q is a sample of experiment runs and rank i refers to the rank position of the first relevant outcome for the ith run. MRR is commonly used in Information Retrieval and Question Answering, but has also shown to be well suited for collocation discovery; see, e.g., (Wu et al., 2010).
We evaluated four different configurations of our technique against two baselines.
The first baseline (S1) is based on the regularities in word embeddings, with the vec(king) − vec(man) + vec(woman) = vec(queen) example as paramount case. In this context, we manually selected one representative example for each semantic gloss to discover collocates for novel bases following the same schema; cf., e.g., for the gloss 'perform' vec(take) − vec(walk) + vec(suggestion) = vec(make) (where make is the collocate to be discovered); see (Rodríguez-Fernández et al., 2016) for details. The second baseline (S2) is an extension of S1 in that its output    The four configurations of our technique that we tested were: S3, which is based on the transition matrix for which GoogleVecs is used as reference vector space representation for both bases and collocates; S4, which applies POS-pattern and N P M I filters to the output of S3; S5, which is equivalent to S3, but relies on a vector space representation derived from Wikipedia for learning bases projections and on a vector space representation from GoogleVecs for collocate projections; and, finally, S6, where the S5 output is, again, filtered by POS collocation patterns and N P M I.

Discussion
The results of the experiments are displayed in Table 2. In general, the configurations S3 -S6 largely outperform the baselines, with the exception of the gloss 'increase', for which S2 equals S6 as far as p is concerned. However, in this case too MRR is considerably higher for S6, which achieves the highest MMR scores for 6 and the highest precision scores for 7 out of 10 glosses (see the S6 columns in Table 2). In other words, the full pipeline promotes good collocate candidates to the first positions of the ranked result lists and is also best in terms of accuracy.
Comparing S1, S3, S5 to S2, S4, and S6 , we may conclude that the inclusion of a filtering module (and, in particular, of an N P M I filtering module) contributes substantially to the overall precision in nearly all cases ('decrease' being the only exception). The comparison of the precision obtained for configurations S3 and S5 also reveals that for 7 glosses the strategy to model C and B on different corpora paid off. This is different as far as MRR is concerned. Further investigation is needed for the examination of this discrepancy.
We can observe that certain glosses seem to exhibit less linguistic variation, requiring a less populated transformation function from bases to collocates. Consider the case of 'show', which generates with only 49 training pairs the second best transition matrix, with p=0.70. It is also informative to contrast the performance on pairs of glosses with opposite meanings, such as e.g., 'begin to perform ' vs. 'stop performing'; 'increase' vs. 'decrease'; 'intense' vs. 'weak'; and finally 'create, cause' vs. 'put an end'. Better performance is achieved consistently on the positive counterparts (e.g. 'begin to perform' over 'stop performing'). A closer look at the output reveals that in these  Table 4: Precision of the coarse-grained evaluation of the S6 configuration cases positive glosses are persistently classified as negative. Further research is needed to first understand why this is the case and then to come up with an improvement of the technique in particular on the negative glosses. The fact that for some of the glosses precision is rather low may be taken as a hint that the proposed technique is not suitable for the task of semanticsoriented recognition of collocations. However, it should be also stressed that our evaluation was very strict: a retrieved collocate candidate was considered as correct only if it formed a collocation with the base, and if it belonged to the target semantic gloss. In particular the first condition might be too rigorous, given that, in some cases, there is a margin of doubt whether a combination is a free co-occurrence or a collocation; cf., e.g., huge challenge or reflect [a] concern, which were rejected as collocations in our evaluation. Since for L2 learners such co-occurrences may be also useful, we carried out a second evaluation in which all the suggested collocate candidates that belonged to a target semantic gloss were considered as correct, even if they did not form a collocation. 6 Cf. Table 4 for the outcome of this evaluation for the S6 configuration. Only for 'perform' the precision remained the same as before. This is because collocates assigned to this gloss are support verbs (and thus void of own lexical semantic content).

Conclusions
As already pointed out in Section 1, a substantial amount of work has been carried out to automatically retrieve collocations from corpora (Choueka, 1988;Church and Hanks, 1989;Smadja, 1993;Lin, 1999;Kilgarriff, 2006;Evert, 2007;Pecina, 2008;Bouma, 2010;Futagi et al., 2008;Gao, 2013). Most of this work is based on statistical measures that indicate how likely the elements of a possible collocation are to co-occur, while ignoring the semantics of the collocations. Semantic classification of collocations has been addressed, for instance, in (Wanner et al., 2006;Gelbukh and Kolesnikova., 2012;Moreno et al., 2013;. However, to the best of our knowledge, our work is the first to automatically retrieve and typify collocations simultaneously. We have illustrated our approach with 10 semantic collocation glosses. We believe that this approach is also valid for the coverage of the remaining glosses (Mel'čuk (1996) lists in his typology 64 glosses in total). Distributed vector representations (or word embeddings) (Mikolov et al., 2013c;Mikolov et al., 2013a), which we use, have proven useful in a plethora of NLP tasks, including semantic similarity and relatedness (Huang et al., 2012;Faruqui et al., 2015;Camacho-Collados et al., 2015;Iacobacci et al., 2015), dependency parsing (Duong et al., 2015), and Named Entity Recognition (Tang et al., 2014). We show that they also work for semantic retrieval of collocations. Only a small amount of collocations and big unannotated corpora have been necessary to perform the experiments. This makes our approach highly scalable and portable. Given the lack of semantically tagged collocation resources for most languages, our work has the potential to become influential in the context of second language learning. The datasets on which we performed the experiments as well as the details concerning the code and its use can be found at http://www.taln.upf.edu/content/resources/765.

Acknowledgements
The present work has been partially funded by the Spanish Ministry of Economy and Competitiveness (MINECO), through a predoctoral grant (BES-2012-057036) in the framework of the project HARenES (FFI2011-30219-C02-02), and by the European Commission under the grant number H2020-645012-RIA. We also acknowledge support from the Maria de Maeztu Excellence Program (MDM-2015-0502). Many thanks to the three anonymous reviewers for insightful comments and suggestions.