Exploring the variety of corpora

The importance of corpora (language databases) for language technologies, as it has mentioned in previous articles, can’t be overstated. Corpora are effectively the foundations of all language technologies. Thus, the better the corpora are designed and built, the better the technologies will work. 

In a previous article, we talked about the considerations one should have before setting out to collect language samples for a corpus. In brief, a corpus purpose is to represent a context of language use. It could be said to be a model of such a context of use. Thus, when selecting the language samples that will constitute our model, we should always try to balance neutrality (minimizing bias) and selectivity (actually capturing representative language uses).

That being said, in the present article we will move away from theory a little bit, focusing on a non-exhaustive classification of the corpora you can find out there.

Considerations Before Building a Corpus

In previous articles, we have talked about natural language processing (NLP) and how it combines linguistics and computer science to create different language technologies, such as the spell checkers we use every day when we chat in our cellphones, or apps like WriteBetter, which aim to help writers of all levels to polish their pieces through a data-driven model of word suggestion. However, it must be said that the results you get from all sorts of language technologies are only as good as the language data that underpins them. Thus, in this article, we’ll present some of the considerations that one needs to have in mind when collecting language data for a language app.

Concordancers: taking a peek into the linguistic context of a word

Natural language processing (NLP), the branch of computer science concerned with transforming human language (as opposed to formal languages, such as math or computer code) into something readable and understandable by a machine, has generated many interesting applications that are commonplace today: spell correction, word prediction, search engines, automatic translation apps (a.k.a. machine translation), chatbots, among others.

Even when these apps can be useful for the language learner (who hasn’t googled for a word?), they are designed with the general digital user in mind. However, there are some apps that —while using similar NLP technology— are more directly concerned with language and that can be a better aid for writers wanting to broaden their vocabulary and polish their style. In this article, I’ll tell you about one particular language technology that I believe is both very simple and tremendously useful for language learning: concordancers.