Reputation: 5626
As per official documentation of Spark NaiveBayes:
It supports Multinomial NB (see here) which can handle finitely supported discrete data.
How can I handle continuous data (for example: percentage of some in some document ) in Spark NaiveBayes?
Upvotes: 1
Views: 610
Reputation: 330203
The current implementation can process only binary features so for good result you'll have to discretize and encode your data. For discretization you can use either Buketizer
or QuantileDiscretizer
. The former one is less expensive and might be a better fit when you want to use some domain specific knowledge.
For encoding you can use dummy encoding using OneHotEncoder
. with adjusted dropLast
Param
.
So overall you'll need:
QuantileDiscretizer
or Bucketizer
-> OneHotEncoder
for each continuous feature.StringIndexer
* -> OneHotEncoder
for each discrete feature.VectorAssembler
to combine all of the above.* Or predefined column metadata.
Upvotes: 1