Evaluating language identification methods

Question

Part of my thesis work is to evaluate number of language detection methods that are already available and then finally implement one them. For this I have chosen the following methods,

N-Gram-Based Text Categorization by Cavnar and Trenkle
Statistical Identification of Language by Ted Dunning
Using compression-based language models for text categorization by Teahan and Harper
Character Set Detection
A composite approach to language/encoding detection

I have to first evaluate the methods and preferably present a table with accuracy for each of these methods. My question is that in order to find the accuracy of each of these methods, do I need to go ahead a build the language models using training data, then test them and record the accuracy or is there any other approach that I can follow here. Though most of the researches already include these accuracy tables, I am not sure if it's accepted in my education to simply grab it and present in the report.

Appreciate any thoughts on this.

Evaluating language identification methods

Answers (1)

Related Questions