Considerations Before Building a Corpus

In previous articles, we have talked about natural language processing (NLP) and how it combines linguistics and computer science to create different language technologies, such as the spell checkers we use every day when we chat in our cellphones, or apps like WriteBetter, which aim to help writers of all levels to polish their pieces through a data-driven model of word suggestion. However, it must be said that the results you get from all sorts of language technologies are only as good as the language data that underpins them. Thus, in this article, we’ll present some of the considerations that one needs to have in mind when collecting language data for a language app.

What is a corpus?

Before we go further, let’s introduce a key concept, that of a corpus. From now on, whenever we speak of a corpus we’ll be referring to a collection of language data that intends to represent how language is actually used in a given context. That is, it aims to be a model of a language reality, say, that of short stories or physics papers. In any case, we must always bear in mind the context of use that we want to represent because each of them has its own peculiarities and conventions for language use.  Having said that, it follows that one of the key issues in corpora building is how to manage to correctly represent a context of use.

Balancing Neutrality and Selectivity

There is no straightforward answer to that issue. But, what we do have is some principles to consider, particularly those of language data neutrality and selectivity.

Neutrality means trying to maintain proportionality of all the language aspects that your language data may represent, such as form, tenses, registers, and even the topics they address. Even if your one only aim is to represent a very specific context of use, it is important to consider the selection criteria so as to maximize as much as possible the neutrality of your data, as this is a recommended practice to help others reutilize and extend your data.

That being said, the truth is that despite our best efforts, neutrality can never be entirely achieved. Moreover, there are some language features that occur rarely, but that might be important to capture. Thus, neutrality alone won’t do, we must also deliberately select some samples that contain significant language data.

Corpora, as language models, are built while trying to balance this contradiction. On the one hand, we want to represent language as it occurs in a context of use, while on the other, we want to maximize the occurrence of characteristic aspects. 

In principle, one way in which this can be done is by tuning the corpus’ size. The bigger a corpus is, the better are the chances of representing all aspects of language. That is why among known corpora, sizes go from the millions on.

Now, size alone cannot guarantee the occurrence of representative words. This is due to a curious behavior of texts first described by George Zipf that was later formalized into what is now known as Zipf’s law. It basically describes a mathematical pattern in the distribution of word occurrences in a text that — broadly speaking— remains constant across all kinds of texts.

Thus, according to this law, if you were to count the frequencies of all words in any given text, you would find that just a few words account for most (>90%) of the occurrences in the text. Conversely, you would find that many words occur only a few times. Usually, the former group consists of functional words, whereas the latter is composed of content words. Besides the mathematical curiosity, this pattern entails that in order to properly build a corpus, you have to carefully select text samples that contain the kind of words that better represent a context of use, such as specific terms or verbs. Otherwise, you might not catch them, or only manage to include 1 or 2 examples, which is not really a good number to draw conclusions from, in the sense of analyzing concordances or particular word usage.

There are, of course, many other considerations to have in mind when building a corpus, that have to do mostly with what you intend to use it for. For instance, you might want to build a corpus specialized in a particular author, say, Shakespeare, to computationally compare his writing style to others’, a line of work that corresponds to stylometry. Or you might want to compare the language use of different genres. Or, as it is done in the data-driven lines of language learning/teaching, you might want to build a corpus to have a source of real language use with which to exemplify particular grammar points, word senses, or even non-standard uses.

It was the latter point that drove us to develop WriteBetter. Given that we wanted to create a tool that could help all kinds of writers to improve their written pieces, we set out to integrate over 60GB worth of corpora, representing a variety of contexts of use and domains of knowledge, in an App that could seamlessly integrate into the writing process.

Linguistic search