Exploring the variety of corpora

The importance of corpora (language databases) for language technologies, as it has mentioned in previous articles, can’t be overstated. Corpora are effectively the foundations of all language technologies. Thus, the better the corpora are designed and built, the better the technologies will work. 

In a previous article, we talked about the considerations one should have before setting out to collect language samples for a corpus. In brief, a corpus purpose is to represent a context of language use. It could be said to be a model of such a context of use. Thus, when selecting the language samples that will constitute our model, we should always try to balance neutrality (minimizing bias) and selectivity (actually capturing representative language uses).

That being said, in the present article we will move away from theory a little bit, focusing on a non-exhaustive classification of the corpora you can find out there.

According to the language samples source

A first criterion to classify corpora would be the source or origins of the language samples it contains. Thus, we can distinguish between textual and spoken corpora.

Textual corpora are basically a collection of language samples coming from written documents, be them news articles, books, etc., whereas the spoken corpora correspond to a collection of transcriptions of spoken language recordings. Commonly, transcriptions are orthographic, but it is possible to find cases of phonetic (or even phonological) transcriptions. If they include the original audio recordings, they become speech corpora, which are commonly used to train speech recognition systems (e.g. Siri).

According to text type distribution

We have mentioned that an ideal corpus reflects in an unbiased and proportional way the reality of language use in a given context. Well, that is not necessarily always the case (it’s the ideal, though). So, we can further classify corpora according to the way in which they manage the distribution of the varieties of language samples they contain.

Albeit an ambiguous term, we can speak of abig corpus as one that does not have an established size limit nor specifications about the distribution of different text types. In such a corpus, the focus is on volume, not on careful design.

Conversely, a balanced corpus entails great care in the proportional distribution of the different text types within it.

In another vein, we can also find what are known as monitor corpora, which have a relatively constant volume of language samples, but these are dynamic, being constantly changed for newer samples. As its name implies, this kind of corpora are used to monitor the reality of language in real time.

Finally, there are also parallel corpora, which I am sure we all have used more often than not. Parallel corpora are collections of documents and their translations to one or more languages. Usually, terms in the different versions of a document are mapped to each other, so as to help translators in their terminological research and train machine translation algorithms like Google Translate.

According to their specificity

We can further classify corpora according to the specificity of the context of use they aim to model.

In this line of thought, we understand a specialized corpusas a corpus that aims to model the language use of a specific knowledge area (e.g. Biology) or of particular language uses (e.g. writings in verse). Specialized corpora are tremendously useful for lexicography, in that they allow researchers to identify emerging terms and language uses. 

On the other hand, a generic corpusis one that tries to represent how language is used in a particular writing genre. They are becoming increasingly popular in literary studies research, as they provide scholars with statistical tools to pinpoint characteristic features. Canonical corpora, which cover all the written works of a specific author, are similarly used in literary studies and applied linguistics research, particularly in an area known as stylistics, to identify linguistic features, patterns, turns of phrase, etc. that can be said to characterize the style of a given author. An interesting application of such an empirical model of an author’s style is known as authorship attribution and it basically involves using these kinds of models to compare the style of a document whose author is unknown to that of previously constructed models of known authors.

Finally, we also find general corpora, which intend to represent language use in as wide a context as possible. In other words, the idea of general corpora is to not focus on particular genres, fields of knowledge, etc., but to represent them all to some extent.

As a side note, there is a kind of corpus that focuses on foreign language learner’s use of L2, known as learner corpora, they provide evidence of systematic errors, L1 interference, etc., providing an empirical basis for characterizing the different developmental stages in language learning, error analysis, communicative strategies, and potentially so much more.

As a closing remark, we must always bear in mind that a corpus design responds to a particular purpose and represents a particular context of use (however general or specific). This does not mean that a corpus cannot be reused and extended well beyond its original design. Indeed, the corpus linguistics community thrives in reutilizing ready-made corpora.

Let’s take our case, for example. WriteBetter feeds on over 60GB of corpora of different kinds. Though there were certain problems that we could avoid (language, for instance, as we only deal with corpora in English), there were very particular challenges, as well. Since our purpose was to create a tool that could help writers of all proficiencies write better in whatever genre or domain of knowledge they set out to work on, we had to consider both general and specialized corpora. The challenge, thus, was to integrate them in such a way as to minimize overlap and bias, while accounting for as much language uses as possible.

Linguistic search