avizzzy
avizzzy

Reputation: 535

Evaluating language identification methods

Part of my thesis work is to evaluate number of language detection methods that are already available and then finally implement one them. For this I have chosen the following methods,

  1. N-Gram-Based Text Categorization by Cavnar and Trenkle
  2. Statistical Identification of Language by Ted Dunning
  3. Using compression-based language models for text categorization by Teahan and Harper
  4. Character Set Detection
  5. A composite approach to language/encoding detection

I have to first evaluate the methods and preferably present a table with accuracy for each of these methods. My question is that in order to find the accuracy of each of these methods, do I need to go ahead a build the language models using training data, then test them and record the accuracy or is there any other approach that I can follow here. Though most of the researches already include these accuracy tables, I am not sure if it's accepted in my education to simply grab it and present in the report.

Appreciate any thoughts on this.

Upvotes: 0

Views: 105

Answers (1)

Tommi J.
Tommi J.

Reputation: 11

I would also suggest asking your thesis advisor. Implementing all of them will be a lot of work, and it is very difficult to really compare them without being able to test them. If I remember correctly the last three have not been well evaluated in the literature, so it would be difficult to compare their results. I have implemented (and evaluated) only the first one of those myself. One big question is also how big a part of your thesis this LI evaluation and implementation is?

Upvotes: 1

Related Questions