How to compare complexities of corpora?

Question

I would like to compare how complex (varied or predictable) are my three corpora. They are from different topics, so some vocabulary is different, some is the same. Looking at one of the data sets it's clear that the syntax is more difficult than in the other two, sentences are longer, etc. I built word N-Gram language models using the SRILM toolkit (I'm new to language modelling) with the idea that I can then compare these models. One measure mentioned in relation to language models is perplexity. I'm confused about the following question: Can I just use perplexities of the three LMs directly as a measure of how varied are the corpora? The vocabulary and the sizes of the corpora are different, so now I think that this won't be a good comparison. I also built LMs from POS-Tags but the quality of the POS-Tagging result is not good because the language is from fora, has spelling mistakes, ungrammatical sentences and so on. What measures could be used to compare complexity of corpora from different domains? I'd appreciate your advise. [I'm new to Stackexchange. I posted this on Crossvalidated, but I think maybe here is more appropriate forum.]

How to compare complexities of corpora?

Answers (1)

Related Questions