WriteBetter: A tool based on Corpora and Natural Language Processing

Are you curious about how WriteBetter works?
In this article, I’ll briefly explain the theory and technology behind WriteBetter and compare it with similar tools.

What is WriteBetter?

It is a tool fed by massive amounts of real written language samples stored in repositories over which a series of algorithms are run to clean and organize the language for future retrieval, analysis, and visualization.

Quite a handful, I know. But there are actually only 3 key points to take from this:

  1. It’s based on huge collections of real written language samples.
  2. These samples have to be computationally processed to be useful.
  3. The result of the previous steps has to be visualized nicely.

What is the theory behind WriteBetter?

The theory behind the first point is known as corpus linguistics, maybe you’ve already heard about it, but for those of you who haven’t: it is a branch of linguistics that aims to draw conclusions from representative samples of real language use in any given context. And when it comes to samples, the more, the better.

Why are samples useful?

Well, if we want to really know how language behaves in a given context, say, in the news, we ought to look and learn from the data. In this way, we can realize if there’s any tendency to use certain words, structures, etc. over others and act accordingly (e.g. try to replicate that in our own written pieces).

However, not just any language use can count as a sample. One has to define certain criteria of what will count as a sample beforehand. Some considerations commonly made are what geographical variants we want to consider, what type of genres, how are we going to balance the diversity of our samples, and a long etc.

Then, these samples are basically thrown together in huge language repositories known as corpus (a fancy Latin word that means “body” and which plural is corpora).

The second point extends from this. Once we have our corpus, we can probably read a few language samples, say, 50, 100 or even more to draw conclusions. But when we talk about millions, it is simply not feasible. That’s why we use computers to do the job for us. Indeed,  it was due to this very fact that corpus linguistics was one of the first branches of linguistics to readily open up to the use of new technologies. 

When we say “millions of samples” we do mean it. The Intelligent Web-based Corpus, one of the most widely used corpora out there, consists of 14 billion words. Its main drawback, though, is that it doesn’t really limit the sample-taking following thorough criteria, that is, it considers anything put on the Web as a valid language sample. A better example could be the Corpus of Contemporary American English (COCA), with its 560 million words, and the plus of well-thought and carefully enacted sampling considerations.

That being said, let’s come back to what I was saying about using computers, particularly to something called natural language processing (NLP).

What is Natural Language Processing (NLP)?

NLP is a branch of computer science that deals with the problem of allowing computers to read and operate over natural language (as opposed to formal languages, such as programming languages, or structured data, like spreadsheets). It contemplates a series of steps to transform raw text (literally, any piece of writing you might have in a text or word file) into something that a machine can read and “understand”. 

A first step, for instance, could be to split a text into words. Here one might ask: But what exactly is a word? An easy approach could be to say that anything separated by blank spaces is a word. But again, what happens with acronyms? Or hyphens? It is not as straightforward to define as it may seem.

Another common task is to teach the computer to differentiate between grammatical categories, so that you may search for grammatical patterns rather than exact words. 

Among the things that can be built combining corpora and NLP are the ever infamous spell checkers, search engines like Google, or tools like COCA and WriteBetter.

What is COCA?

The Corpus of Contemporary American English (COCA) is both a corpus and a web application to query over the corpus. If you’re curious, it’s readily available for you to check here. You’ll see that is has a very simple interface, where you can query for the occurrences of a specific word, or even check the context (called collocates) in which it appears. Altogether, COCA is quite a useful tool if you are interested in doing corpus research, but it has its own limitations. Particularly important, I think, are the facts that there’s a limit over the number of queries you can make in any given day, that the visualization is not-so-friendly, and that you cannot integrate it with your writing apps. Indeed, COCA is definitely more oriented towards researchers than to writers.

That’s where WriteBetter comes in handy. It uses the same technology and follows the same principles we have been discussing, but it has been developed with writers’ needs in mind. Thus, it not only feeds of a vast repository of relevant language samples, carefully selected to be of use to writers looking forward to honing their skills, but it also, and most importantly, integrates seamlessly with the most commonly used writing apps, displaying relevant language examples in a nice and friendly way.

Linguistic search