Reputation: 31
I would like to compare how complex (varied or predictable) are my three corpora. They are from different topics, so some vocabulary is different, some is the same. Looking at one of the data sets it's clear that the syntax is more difficult than in the other two, sentences are longer, etc. I built word N-Gram language models using the SRILM toolkit (I'm new to language modelling) with the idea that I can then compare these models. One measure mentioned in relation to language models is perplexity. I'm confused about the following question: Can I just use perplexities of the three LMs directly as a measure of how varied are the corpora? The vocabulary and the sizes of the corpora are different, so now I think that this won't be a good comparison. I also built LMs from POS-Tags but the quality of the POS-Tagging result is not good because the language is from fora, has spelling mistakes, ungrammatical sentences and so on. What measures could be used to compare complexity of corpora from different domains? I'd appreciate your advise. [I'm new to Stackexchange. I posted this on Crossvalidated, but I think maybe here is more appropriate forum.]
Upvotes: 3
Views: 154
Reputation: 619
"I also built LMs from POS-Tags but the quality of the POS-Tagging result is not good because the language is from fora, has spelling mistakes, ungrammatical sentences and so on."
Aside from it being noisy, like you pointed out, you should think carefully about whether particular linguistic features are useful in your analysis. Does one corpus having proportionally more nouns move you toward what you want to learn about the corpora? Maybe in something like authorship attribution, but I can't really think of anywhere else that's effective.
If data sparsity is an issue, LSI can help by collapsing related terms together. This could also help with the spelling issues, collapsing poorly spelt words with their correct counterparts if they appear in similar contexts.
"The vocabulary and the sizes of the corpora are different, so now I think that this won't be a good comparison."
It's not the end of the world. Having more data is always better, but you can work with what you have.
If you haven't chosen a language model yet, there's a few decisions you have to make:
You mention that you have a language model; I'm assuming your language model is a probability distribution such that P(N-gram|topic)
. If this is correct, you've already normalized the data, so the two probability distributions should be readily comparable. Having more data would get you a more reliable result, but if your corpora are "big enough" to sample each topic reliably, you can move right into comparison.
As for comparison, try the KL-Divergence. KL-Divergence is "a measure of the information lost when Q is used to approximate P." Less loss means that the corpora are more similar. If you want a symmetric comparison, a "cheap" way to do it is to add D(P||Q) + D(Q||P)
. Note, though:
The KL divergence is only defined if Q(i)=0 ⇒ P(i)=0, for all i (absolute continuity).
So you'll have to smooth, in some manner.
Upvotes: 2