The importance of corpora (language databases) for language technologies, as it has mentioned in previous articles, can’t be overstated. Corpora are effectively the foundations of all language technologies. Thus, the better the corpora are designed and built, the better the technologies will work.
In a previous article, we talked about the considerations one should have before setting out to collect language samples for a corpus. In brief, a corpus purpose is to represent a context of language use. It could be said to be a model of such a context of use. Thus, when selecting the language samples that will constitute our model, we should always try to balance neutrality (minimizing bias) and selectivity (actually capturing representative language uses).
That being said, in the present article we will move away from theory a little bit, focusing on a non-exhaustive classification of the corpora you can find out there.