How to represent a word and predict it, too: Improving tied architectures for language modelling

Recent state-of-the-art neural language models share the representations of words given by the input and output mappings. We propose a simple modification to these architectures that decouples the hidden state from the word embedding prediction. Our architecture leads to comparable or better results compared to previous tied models and models without tying, with a much smaller number of parameters. We also extend our proposal to word2vec models, showing that tying is appropriate for general word prediction tasks.


Introduction
In neural models, reusing representations of the same type of data (e.g., sentences or words) in different parts of the architecture can be a powerful way to aid learning: it reduces parameters, enabling more compact models and faster learning. Recurrent neural network (RNN) language models (Mikolov et al., 2010;Sundermeyer et al., 2012;Zaremba et al., 2014) have two word mappings: From the input word onto its embedding representation, and from the internal representation of the network (the hidden layer) to the weights for the prediction of the next word. In standard models, these representations are different. Recently, Inan et al. (2017) and Press and Wolf (2017) proposed to instead use a single word representation, tying the input and output mappings. Intuitively, both are representations of the same type of data (words), and information learnt when observing a word as input can be reused when predicting this word as output. Tied language models obtained better perplexity and better word similarity scores of embedding matrices while reducing the number of parameters. The models that achieve the latest state-of-the art results incorporate this technique (see, e.g., Merity et al., 2018).
However, note that, by tying the output mapping to the input mapping, the hidden layer of the network is optimised to match the representation of the predicted word. We suggest that this introduces a constraint that conflicts with the function of the hidden layer in language models to represent the previous context and transmit information to the next timestep. In this paper, we propose a minimal modification to tied LM architectures to address this issue: we add a linear transformation between the hidden layer and the word embedding prediction, partially decoupling the two. This has an important advantage. Standard tied architectures require the hidden layer to have the same dimensionality as the word embeddings. We lift this constraint in our architecture: by separating the hidden layer and the word mapping, we can choose a large hidden layer dimensionality while keeping the embedding dimensionality and, consequently, the size of the embedding matrix small.
In a set of experiments on LM, we show that our tied models achieve results similar to or better than models with standard or no tying, with much smaller embedding sizes and a reduction of 30-60% in the overall number of parameters. Notably, the word embeddings obtained with the modified model have a higher quality than double-sized embeddings obtained with standard tied models, as measured on word similarity.
We further extend this idea to word representation learning models (in particular, word2vec), which have a similar architecture and objective function to language models. For instance, the standard skipgram model (Mikolov et al., 2013) has two mappings, one for the context words and one for the target word. While tying these two matrices directly constraints learning too strongly, an additional linear mapping adds the sufficient capacity to learn embeddings of the same quality as the standard model using only half the parameters.

Tied LM architectures 2.1 Previous work
Equations (1) through (4) define a standard RNN language model (Mikolov et al., 2010): where x t is the one-hot encoding of the input word at time t,x t is its embedded representation and E is the embedding matrix (input mapping). We follow the majority of previous work in adopting an LSTM (Hochreiter and Schmidhuber, 1997) as the recurrent unit, as it was shown to outperform other recurrent architectures for LM (Jozefowicz et al., 2015). The hidden vector h t is used as input to the LSTM for the next time step and also as input to a linear transformation W (output mapping) which produces the output weights. These weights are normalised into probability scores using the softmax function. The tied models proposed by Inan et al. (2017) and Press and Wolf (2017) set W to be equal to E (Figure 1b). Note that since E is of size |V | × m, where m is the embedding size, and W is of size |V |×n, where n is the hidden state size, the hidden and embedding dimensions must be equal, that is, m = n. Press and Wolf (2017) observe that the output matrix W represents an embedding matrix since two similar words with indices i and j are learnt to receive similar probabilities given a context and hence the rows W i and W j should also be similar. Indeed, they show that on some word similarity evaluation tasks the output matrix W outperforms significantly the input matrix E. Tying E and W makes the model share the representations for the input and output vocabularies.
Inan et al. (2017) suggest a theoretical motivation for the tying technique and derive it as an instance of a more general approach of augmenting the cross-entropy loss. They show that a loss that takes into account not only the target word (i.e., log p i for cross-entropy) but the scores for all words in the vocabulary according to their similarity to the target (computed as dot-product of embeddings in E) improves performance on LM. One important practical advantage of tying input and output matrices is the reduction in number of parameters with respect to a standard model with the same hidden and embedding dimensions, since instead of two matrices E and W of size |V | × m we have only one matrix of size |V | × m.

Proposed modification
One potential problem with the tied model, where E = W , is that the hidden state h t is optimised to be close to the embedding of the target word. To see this, consider that o j = e j · h t , ∀j. The cross-entropy objective is to maximise log p i for the target word with index i and consequently to minimise log p j ∀j = i. Hence, o i = e i · h t will be increased by the gradient descent update and h t will be aligned closer with e i (for each dimension k of the two vectors, h tk · e ik will be increased). This association between h t and word embedding space could prevent efficient retention of LSTM history in h t , which is used as input for the following time step.
To address this issue, we propose a simple modification to the standard tied model, replacing the output transformation (3) as follows: L is an additional linear transformation that decouples the hidden state h t , which is passed to the next time step and represents the previous, possibly long-term, linguistic context, fromĥ t , which is instead optimised to match the embedding of the output word. As Figure 1 illustrates, an important advantage of the additional transformation L is that a model can have different dimensions for the hidden vector (n) and the embedding vectors (m). The embedding matrix is the largest part of the model when the vocabulary is large (|V | n). Reducing the size of the embedding m leads to a significant reduction in the number of parameters, proportional to |V |, and the acceleration of softmax computation. On the other hand, the size of the additional matrix L is only n × m and contributes very little to the overall size of the model. We test empirically how reducing the embedding size affects the performance of language models by varying hidden and embedding sizes in our experiments and evaluating embedding matrices on word similarity tasks.
Standardly used LM models often have two layers of LSTM cells. Thus, the issue we identifed might be mitigated in practice, since the hidden state of the first layer is not directly affected by tying the output and input transformation matrices. Moreover, an LSTM cell carries over information both through the hidden state and a memory state; the latter is affected by tying only indirectly (see Hochreiter and Schmidhuber (1997) for details on LSTM architectures). However, our experiments show that, in practice, two-layer LSTM LMs are still affected by tying despite these caveats.

Extending the tied technique to word2vec
The tied technique, as formulated above, can be in principle applied to any model which has the same general objective of LM: predicting a target word given context words. The CBOW word2vec model for word representation learning (Mikolov et al., 2013) is the primary candidate for testing the applicability of the tied technique beyond LM, since we can see it as substituting the LSTM function (2) with a simple sum of context word embeddings h = ix i (where x i are words in the context win-dow, e.g., of size 5). Similarly then, the equations in (5) describe the tied version of CBOW model. The linear step here provides the capacity to learn a transformation from the sum of embeddings to the predicted embeddingĥ. Without such transformation, the tying model would assume that the sum function is always a good approximation of the output embedding. The skipgram word2vec model (Mikolov et al., 2013) employs a variant of the LM objective: It is trained to predict context words given a word, instead of the opposite. As in CBOW and neural language models, words are both inputs and targets, making the use of tying an option also for this architecture. Press and Wolf (2017) apply direct tying to this architecture and report that the quality of the obtained embeddings is below the quality of non-tied skipgram embeddings. Unlike CBOW or LSTM, the input-to-hidden state function of the skipgram model is identity, reducing the tying model objective tox i =x j for every pair of input-output words i, j. It is thus not surprising that enforcing the tying constraint leads to poor empirical results. We test whether adding an additional linear transformation improves performance of the tied technique also for a skipgram model.

Evaluation data
We use two corpora for the evaluation of language models. First, we employ a mediumsized corpus of approximately 100M tokens with a relatively large vocabulary, 50K words, created from a Wikipedia dump (henceforth, Wiki). 1 To allow comparison with previous work, we also evaluate on the Penn Treebank (PTB), which is small but has been used as a benchmark for LM since Mikolov et al. (2011). The PTB has approximately 1M tokens and is preprocessed to have 10K vocabulary words; we use the standard trainvalidation-test split.
Furthermore, to evaluate the quality of the embeddings induced by the language models, as well as for the word representation experiments in Section 3.4, we use three standard word similarity datasets: SimLex-999 (Hill et al., 2015;SimLex), MEN (Bruni et al., 2014), and RareWords (Luong et al., 2013;RW). The performance on these datasets is evaluated in terms of Spearman corre-

Training setup
As our base language models, we adopt the ones proposed in Zaremba et al. (2014). We use 2-layer LSTMs with dropout applied to the input embedding, to the output of the first LSTM layer and to the output of the second layer. We used the PyTorch implementation 3 and modified it to include the additional linear layer for our tied models. We report the best model after the hyperparameter search for dropout and learning rate (see the details in Appendix A).

Language modelling results
We present the LM results for the standard nontied model, the tied model as in Inan et al. (2017) and Press and Wolf (2017), and our tied model with an additional linear transformation (tied+L) in Tables 1 (PTB) and 2 (Wiki). Table 1 confirms that tying generally brings gains with respect to not tying. This is also true for the cases when the hidden and embedding sizes are different (e.g. 400/200 and 600/400), where our tied+L model outperforms the non-tied model by 5 to 6.4 points having around 40% less parameters. Furthermore, our decoupled model slightly but consistently improves results with respect to standard tying, confirming our intuition that the coupling of the hidden state to the embedding representation is a limiting constraint. Smaller tied+L models perform well compared to larger tied models. In particular, the tied+L model with 600/400 units has perplexity of 76.0, compared to 76.1 of the tied 600/600 model, with 55% the number of parametres. Note that our results are comparable to previously reported perplexity values on PTB for similar models. Our best results of 75.5 test perplexity is only 1.2 points behind the large tied model with 1500 units reported in Press and Wolf (2017) and is only 1.6 points behind the medium tied model with 650 units and variational dropout (Gal and Ghahramani, 2016) reported in Inan et al. (2017).
On the Wiki corpus with larger vocabulary (Table 2), we find that tied models achieve slightly lower perplexity than non-tied models with half the number of parameters, and our proposed tied+L model achieves lower perplexity than the tied model. The most relevant result of the present experiment, however, is that the tied+L model with 300 embedding units is actually better than the tied model with 600 units (38.5 vs 39.7 points; the tied+L model has 20M parameters compared to 36M of the tied model) -that is, a smaller model outperforms a larger model. Thus, our decoupling mechanism not only allows models to have better perplexity, but also more compact word embeddings, which are of a higher quality also as measured on word similarity: .42/.61/.68 for the tied+L embeddings of size 300, compared to .39/.55/.64 for the tied embeddings of size 600. Table 3 presents the evaluation of word2vec models on the three word similarity datasets. We ran the experiments only on the Wiki corpus due to its higher coverage (50K vocabulary), and used embeddings of size 300.

Experiments on word2vec models
Our results on CBOW show that the tied+L architecture obtains comparable results to the non-   tied architecture with almost half the parameters (15.1M vs 30M). This confirms that tying with an additional linear transformation is appropriate not only for language models but for word learning models more generally. The skipgram algorithm shows a small degradation of performance for the tied+L architecture with respect to the non-tied one; note that, as explained in Section 2.3, tying makes the most sense for CBOW. However, the fact that standard tying obtains much worse results (similarly to the results of Press and Wolf, 2017) shows that the linear mapping substantially relaxes the tying constraint.

Conclusions
Overall, our simple modification to tied language modelling architectures generalises previous work by allowing tying without imposing constraints on the number of hidden and embedding dimensions. This leads to flexible architectures with a more efficient use of both hidden states and embeddings. For word representation learning models, having an additional linear transformation reduces the number of parameters while maintaining learning capacity. In general, reducing model size without harming performance is a desirable feature in practice, for example in the case of language models running on mobile devices, and it is also desirable on theoretical grounds, since it is a better use of the learning capacity of neural networks.