Reputation: 729
Lately I've been mucking about with text categorization and language classification based on Cavnar and Trenkle's article "N-Gram-Based Text Categorization" as well as other related sources.
For doing language classification I've found this method to be very reliable and useful. The size of the documents used to generate the N-gram frequency profiles is fairly unimportant as long as they are "long enough" since I'm just using the most common n N-grams from the documents.
On the other hand well-functioning text categorization eludes me. I've tried with both my own implementations of various variations of the algorithms at hand, with and without various tweaks such as idf weighting and other peoples' implementations. It works quite well as long as I can generate somewhat similarly-sized frequency profiles for the category reference documents but the moment they start to differ just a bit too much the whole thing falls apart and the category with the shortest profile ends up getting a disproportionate number of documents assigned to it.
Now, my question is. What is the preferred method of compensating for this effect? It's obviously happening because the algorithm assumes a maximum distance for any given N-gram that equals the length of the category frequency profile but for some reason I just can't wrap my head around how to fix it. One reason I'm interested in this fix is actually because I'm trying to automate the generation of category profiles based on documents with a known category which can vary in length (and even if they are the same length the profiles may end up being different lengths). Is there a "best practice" solution to this?
Upvotes: 1
Views: 889
Reputation: 5525
As I know the task is to count probability of generation some text by language model M.
Recently i was working on measuring the readaiblity of texts using semantic, synctatic and lexical properties. It can be also measured by language model approach.
To answer properly you should consider these questions:
Are you using log-likelihood approach?
What levels of N-Grams are you using? unigrams digrams or higher level?
How big are language corpuses that you use?
Using only digrams and unigrams i managed to classify some documents with nice results. If your classification is weak consider creating bigger language corpuse or using n-grams of lower levels.
Also remember that classifying some text to invalid category may be an error depending on length of text (randomly there are few words appearing in another language models).
Just consider making your language corpuses bigger and know that analysing short texts have higher probability of missclasification
Upvotes: 1
Reputation: 709
If you are still interested, and assuming I understand your question correctly, the answer to your problem would be to normalise your n-gram frequencies.
The simplest way to do this, on a per document basis, is to count the total frequency of all n-grams in your document and divide each individual n-gram frequency by that number. The result is that every n-gram frequency weighting now relates to a percentage of the total document content, regardless of the overall length.
Using these percentages in your distance metrics will discount the size of the documents and instead focus on the actual make up of their content.
It might also be worth noting that the n-gram representation only makes up a very small part of an entire categorisation solution. You might also consider using dimensional reduction, different index weighting metrics and obviously different classification algorithms.
See here for an example of n-gram use in text classification
Upvotes: 1