On the Distribution of Deep Clausal Embeddings: A Large Cross-linguistic Study

Embedding a clause inside another (“the girl [who likes cars [that run fast]] has arrived”) is a fundamental resource that has been argued to be a key driver of linguistic expressiveness. As such, it plays a central role in fundamental debates on what makes human language unique, and how they might have evolved. Empirical evidence on the prevalence and the limits of embeddings has however been based on either laboratory setups or corpus data of relatively limited size. We introduce here a collection of large, dependency-parsed written corpora in 17 languages, that allow us, for the first time, to capture clausal embedding through dependency graphs and assess their distribution. Our results indicate that there is no evidence for hard constraints on embedding depth: the tail of depth distributions is heavy. Moreover, although deeply embedded clauses tend to be shorter, suggesting processing load issues, complex sentences with many embeddings do not display a bias towards less deep embeddings. Taken together, the results suggest that deep embeddings are not disfavoured in written language. More generally, our study illustrates how resources and methods from latest-generation big-data NLP can provide new perspectives on fundamental questions in theoretical linguistics.


Introduction
In a prominent intellectual tradition, the infinitude of human expressivity (Humboldt, 1836) is rooted in a machinery that allows syntactic embedding at arbitrary depth (Chomsky, 1957(Chomsky, , 1995. Regardless of the controversy around this proposal in terms of computational theory (Pullum and Scholz, 2010;Watamull et al., 2014), it remains an open issue to what extent languages in fact deploy structures with arbitrarily deep embedding. Many languages contain specific constructions that cap embedding depth at phrasal levels to one (e.g. unlike English, Modern Greek compounds don't allow recursive embedding; Ralli, 2013), although more radical constraints (Mithun, 1984;Everett, 2005;Evans and Levinson, 2009) seem to be rare and are avoided when languages evolve over time (Widmer et al., 2017). In terms of sentence production, embedding depths seem to be capped at moderate levels, likely because deeper embeddings place increasing demands on the brain's processing system (Gildea and Temperley, 2010). Corpus studies of English, Pirahã, and a few other-mostly European-languages proposed constraints at one (Reich, 1969;Futrell et al., 2016), two (De Roeck et al., 1982), or three (Karlsson, 2010) levels of embedding. However, given that multiple embeddings might be vanishingly rare, a serious limitation of this work is the size of traditional corpora. If multiple embeddings are subject to constraints from processing load, these are likely to be probabilistic (rather than hard) in nature, and deeper embeddings are expected to be so rare that they are detectable only in very big data sets.
Here, we introduce a collection of large written corpora that we annotated using state-of-the-art parsers trained on Universal Dependencies (UD) treebanks (Nivre et al., 2018). We ask whether there are systematic patterns in the construction of complex nested clauses across languages. Instead of focusing on potential upper bounds of embedding depths we are interested in the distribution of syntactic dependencies in our corpora. We ask three questions: (i) How does embedding depth decline? (ii) Is the length of the clauses the same across levels of embedding? (iii) Can the rarity of deep embeddedness be explained by the rarity of longer sentences, or is there a significant preference for simpler structures when sentence length is accounted for? The answer to these questions promises insights into the nature of constraints on the human parser, opening new research avenues on the computational complexity of human language.

Data
Corpora We focus on 17 languages, selected based on data and tool availability. We annotated 2 types of large, publicly available corpora: Wikipedia dumps from March 2017 and, where available, the WMT News Crawl corpora from 2007-2017 (Bei et al., 2018). Table 1 provides basic statistics of the annotated corpora.
Parsing Each corpus was tokenized using UDPipe's (Straka et al., 2016) pre-trained UDv2.2 models (Straka, 2018) and then parsed as follows: We trained Dozat et al. (2017)'s parsing model, a state-of-the-art graph-based neural dependency parser, on the Universal Dependencies 2.2 dataset (Nivre et al., 2018). We used the hyperparameter configuration described in Dozat et al. (2017), and pre-trained FastText word embeddings for frequent words (Bojanowski et al., 2016). We are aiming to make the parsed corpora available as soon as possible.
Measuring embedding depth We approximate embedding relations through dependency graphs. Specifically, for our purposes we define embedding as any dependency such that (i) the dependent is the head of a clause and (ii) permuting head and dependent would lead to an ungrammatical sentence, unlike in "flat" syntactic structures such as coordinated clauses. Any given clause has a depth d, defined as the number of embedded dependencies that need to be traversed in order to reach the root of the sentence from the target clause. For example the sentence in Figure 1 has a maximum embedding depth of 2, since the clause "that runs fast" is 2 ACL:RELCL-arcs from the root, and there exists no other clause in the sentence with a greater such distance.
We do not presently differentiate between center embedding and tail embedding. The difference is eventually important from a cognitive and computational perspective, but our current interest is focused on the overall distribution of embeddings in large corpora. Knowing this distribution is a prerequisite for modelling the impact of more specific  distinctions (Bickel, 2010), such as center vs. tail embedding, or the position of the head (verb-final vs. verb-initial), or different types of clausal embedding (e.g. complement clauses vs. chaining etc.) 3 Results

Maximum Embeddedness Depth
As a first step we explore the tail of the distribution of maximum embeddedness depth in our corpora. We focus on the 1-20 range, for which a majority of the languages in our sample have coverage. The probability distributions are reported in Figure 2. The distributions decay in a continuous fashion rather than finding an abrupt cutoff. An important aspect of characterizing the tail of distributions is whether they can be approximated by an exponential function (Pr(x) ∼ exp(−ax); a > 0) or whether they have a so-called "long-tail" parametrized by a power law (Pr(x) ∼ x −a ; a > 0). One of the essential differences between these types is that long-tailed distributions display a large number of rare events (Khmaladze, 1988) (i.e., in our case, very deep embeddings are attested), in contrast to the exponential regime where the overwhelming majority of observations are bound within a comparatively narrower range. Statistically distinguishing between these types is not always straightforward and several alternative distributional models might yield comparable empirical performance (Clauset et al., 2009). It is possible however at least to compare heuristically the observed data against reference distributions of each type. For this purpose, we included in Figure 2 two exponential distributions flanking the empirical ones (exp(−x/5) and exp(−x/10)), and a powerlaw distribution (x −4/5 ). It can be observed that, while there is a relatively sharp and exponential decrease for the lowest values of embedding depth, the tail of the distributions become progressively less rapidly decaying, sometimes paralleling the behavior of the reference power-law distribution. 1

Clause Depth and Length
As mentioned in the introduction, it is generally accepted that deeper levels of embeddedness imply a larger burden to the human parser. Given the ample evidence that linguistic behavior involves costavoiding strategies (Zipf, 1949), we expect that, all else being equal, clauses of larger d will be shorter, to minimize time spent in states with heavy processing demands. We model clause length (measured in number of orthographic words) in a Poisson regression model, with clause depth as independent variable. The results in Table 2 confirm across-theboard mean clause length reduction in function of depth.
1 Visual inspection suggests that a small proportion of the deepest embeddings are found in degenerate text, e.g., misprocessed tables. Future work should estimate how such noise affects our statistics.

Large Complex Sentences
Deep embeddings might be rare simply because complex, multi-clause sentences are rare in general.
To assess this possibility, we test whether we can detect a bias against deep embedding when taking sentence complexity (in counts of clauses) into consideration.
For this, we introduce a minimal model for evaluating the presence of a bias against deep embeddings. We focus on complex matrix clauses with a large number of embedded clauses, as those are the ones where such a bias is most likely to be detectable if it exists. In practical terms, we evaluate main clauses with 8 or more total embedding dependencies only and at least one clause hosting two or more embedding dependencies. We consider the 14 corpora that contained at least 500 sentences satisfying these conditions. We characterize these matrix clauses with their dependents as directed trees τ (with direction from parent to daughter nodes/clauses). Thus, a clause will be represented by a node with out-degree equal to the number of embedded dependencies hosted by the clause. The in-degree will be either 0 for main clauses and 1 for subordinate clauses. The matrix clause is then the root of such a tree, and the leaves are clauses which do not host any embedded clauses themselves.
To evaluate the observed distribution, we generate a baseline set of trees with no bias against deep embeddings. The baseline trees have (i) the same number of clauses (n), and (ii) the same distribution of embedded dependencies (P(k)), i.e. the same out-degree distribution, as the observed trees.
Under the null hypothesis that there is no bias against deep embeddings, the distribution of observed tree depths should be compatible with the distribution of depths arising from the baseline set of trees.
As an ilustration consider the sentence "[The girl [who likes cars [that run fast]] arrived [as I cooked the pasta [that you gave me]]] ". In the Universal Dependencies convention, denoting each clause by its own head, this can be represented by the tree in Figure 3(a). This sentence has an embedding depth of 2, it has five clauses (nodes) in total, and three clauses have non-zero out-degree. One possible alternative tree with the same number of nodes and the same out-degree distribution is given in Figure 3(b), which has an embedding depth of 3. Hence, the same number of clauses and embedded dependencies distribution yields a tree that is deeper. In the case of large complex sentences with many clauses, there exist multiple such trees that could be leveraged to determine whether the depth of the empirically observed sentence is unusually low or high given what is expected under the baseline.
It should be stressed that this scheme of comparison considers each observed sentence independently: the statistics of other sentences in the same language play no role. In order to evaluate the overall bias in a language, we compare the difference between the observed depth of each sentence against the mean value of 100 permuted baseline trees derived from them, and we aggregate the results of all sentences within a language. If the resulting distribution of depth differences has its probability mass systematically above or below zero, this would speak against the null hypothesis of no bias.
Surprisingly, we find no outstanding systematic pattern in the comparison. While the median and mean values of the differences vary across languages, the distributions hover around zero with a modest variation (so that in general we observe that the empirical values are an average of 1 SD from the reference sample mean). Figure 4 displays the cumulative distribution of the difference between empirical and mean reference embeddedness depth across languages.

Conclusions
We empirically addressed one central issue in theoretical linguistics, namely the nature and distribution of nested clausal embeddings in natural languages. Large corpora and automated annotation tools are crucial to address this question, as deep embeddings are expectedly rare. Our results confirm that there is no sharp boundary on maximum embedding depth. More deeply embedded sentences appear to be shorter (in number of words), and this is in accordance with the hypothesis that they impose a heavier processing load than shallower clauses. However, surprisingly, when sentence complexity (in number of clauses) is accounted for, there is no clear bias against deeper embeddings. This is a first large quantitative exploration of the matter. In future work, we will extend our set of languages, aiming at more typological variety (Indo-European languages are greatly over-represented in our current data). Moreover, our results rely on automated annotation, and we have no estimate of the extent to which they are affected by annotation error. Finally, we have glossed over potential dif- ferences in embedding preferences stemming from differences in types of dependencies (e.g. center vs. tail embedding) and their linearizations (e.g. headinitial vs. head-final), although these differences are likely to play an important role.