How to apply same function on all of the columns of a dataset in parallel using Spark(Java)

Question

I have a dataset with some categorical features. I am trying to apply exact same function on all of these categorical features in Spark framework. My first assumption was that I can parallelize operation of each feature with operation of other features. However I couldn't figure out is it possible or not (confused after reading this, this).

For example, assume that my dataset is as following:

feature1, feature2, feature3
blue,apple,snake
orange,orange,monkey
blue,orange,horse

I want to count the number of occurrences of each category for each feature, separately. For example for feature1 (blue=2, orange=1)

How to apply same function on all of the columns of a dataset in parallel using Spark(Java)

Answers (1)

Related Questions