Ini Koq Apah
Ini Koq Apah

Reputation: 75

Naive bayes text classification calculation, better to do in MySQL or java

The calculation for class conditional probability in naive bayes is

P(t|c) = Log2((n1+1)/(n2+n3))

Where

  1. t = token x; c = class x
  2. n1 = number of token x in class x
  3. n2 = number of all token in class x
  4. n3 = number of all token in all class

Which one is faster, doing calculation in MySQL or in Java (of course we need to grab data from MySQL to use it in Java)?

Upvotes: 0

Views: 307

Answers (1)

Gordon Linoff
Gordon Linoff

Reputation: 1271151

The Naive Bayes classifier is computationally simple, but it requires lots of data manipulations. When applied to text, you are generally looking for a lot of different terms inside the text.

I have a natural bias toward doing these types of calculations in SQL. I would at least argue that MySQL is a reasonable environment for doing this. Depending on the exact nature of the problem and the structure of your data, you might find that full text indexing is helpful. I would be wary about working with a large corpus (many tens or hundreds of gigabytes) on the application side. My book "Data Analysis Using SQL and Excel" has a chapter devoted to Naive Bayes and similar types of models.

Upvotes: 1

Related Questions