sourabh
sourabh

Reputation: 223

Spark Naive Bayes model persistence : understanding pi & theta

I am working on Naive Bayes based implementation and I am using Spark 2.0 for the same, as far as model tuning is concerned I done with it, but I am stuck at persistence of the model, I am well aware of the Model persistence support in Spark 2, but my concerns is with the content of the saved model for naive Bayes particularly in the data folder of saved model, it store value of pi (vector) which is dependent on number of class we have & other is theta (Matrix) which depends up on number of class & number of features set for Naive Bayes, so in sort content of data folder of model depends on actual data and will grow with data size,

Can any one help me with understanding what it stores exactly, I basically need the same to make my decision about where to put these data in my production architecture.

i tried to find a lot on these but don,t understand exactly what they are.. in Spark java docs they are mentioned as

but I am not able to understand what exactly are these value and why they are needed, it will be helpful if anyone help out understanding

Question also relates to the fact that they are added in version 2.0, so prior this in 1.6 it would be working without pi & theta

Upvotes: 3

Views: 640

Answers (1)

Uwe Bretschneider
Uwe Bretschneider

Reputation: 31

These two attributes comprise the Naive Bayes model. Naive bayes is ment to predict a class C given a feature vector X (your input vector). To do this, it relies on Bayes Theorem. With some mathematical magic you can optimize Bayes Theorem for classification, what's left is:

P(C|X) = P(C) * P(x1|C) * ... * P(xn|C).

or further optimized:

P(C|X) = log(P(C)) + log(P(x1|C)) + ... + log(P(xn|C))

On a side note: the symbol "=" is not accurate in this case, it's more like some sort of approximation.

So the model needs to know these probabilities. P(C) seems to be the pi vector. P(xn|C) seems to be the theta matrix. The theta matrix won't grow to infinity. The size depends on the number of input variables xn and the possible values they can have.

Upvotes: 1

Related Questions