daniel
daniel

Reputation: 71

Relationship between vocab size and complexity

I have 2 corpuses, if one has a larger vocabulary size than the other, does it mean its language is more complex?

Apart from complexity of the language, what else can effect the size of the vocabulary in a corpus?

Upvotes: 1

Views: 801

Answers (2)

Anscandance
Anscandance

Reputation: 123

Apart from what Oliver has mentioned, from my professional experience the size of the vocabulary in a corpus often depends on the following:

  1. How exactly do you tokenize and count vocabulary in your corpora? For example, if you count compounds as a number of separate tokens you will have slightly different numbers compared to if you counted each compound noun as one token.
  2. (elaborating on the issue of "topic" mentioned by Oliver above): each particular topic has its own set of terminology (knitting vs airspace engineering) but the total term density will depend on the author's vocabulary.
  3. Inclusion of loanwords

As to your first question of language complexity, every language's complexity is relative to the issue at hand. If we are developing an English-Japanese translator -- the Japanese language is VERY complex, if a Chinese person is learning Japanese, it is MODERATELY complex. If we are comparing inflectional morphology: Russian and German are more complex than English. Basically, there are many ways of looking at the issue of language complexity depending on the participants' perspectives.

Upvotes: 1

Oliver Mason
Oliver Mason

Reputation: 2270

No. Language consists of a lot more than just vocabulary. If the grammatical structures are convoluted, then even a smaller vocabulary can lead to very complex sentences.

In order to answer the second part properly, you'd need to define first what exactly you mean by 'complexity'. This is not a measure that can easily be quantified (such as, eg, sentence length).

Most reading comprehension measures combine the length of words and sentences, on the assumption that longer words and longer sentences are harder to understand; however, shorter words tend to have more different meanings, and are arguably harder to understand if their meaning is not clear from the context.

Update after clarification: The size of the vocabulary depends on various factors, such as:

  1. active vocabulary of the author: if I write a text in my native language (where my vocab is large), the number of different words I use in it will be bigger. If I write in a foreign language where I don't know that many words, it will of course be smaller
  2. the language itself: a bit of an anomaly, but English has a much larger vocabulary than some other languages, due to its history. There are many near-synonyms, so it's easier to use more different word. Other languages are more limited.
  3. topic: this is probably the biggest factor, as a very limited, technical topic will result in a more limited vocab. Wikipedia in general uses a broad range of words, but if you only take the articles on animals, the vocab will be more restricted.
  4. style: similar to (1), I have an influence on the vocab size by how I write. By limiting my vocab, I can make a text more 'plain' (and leave more to the reader's imagination).

Upvotes: 1

Related Questions