How to apply same function on all of the columns of a dataset in parallel using Spark(Java)

I have a dataset with some categorical features. I am trying to apply exact same function on all of these categorical features in Spark framework. My first assumption was that I can parallelize operation of each feature with operation of other features. However I couldn't figure out is it possible or not (confused after reading this, this).

For example, assume that my dataset is as following:

feature1, feature2, feature3
blue,apple,snake
orange,orange,monkey
blue,orange,horse

I want to count the number of occurrences of each category for each feature, separately. For example for feature1 (blue=2, orange=1)

Upvotes: 4

Views: 579

Answers (1)

Jacek Laskowski
Jacek Laskowski

Reputation: 74779

TL;DR Spark SQL's DataFrames are not split per column but per rows so Spark processes group of rows per task (not columns) unless you split the source dataset using select-like operator.

If you want to:

count the number of occurrences of each category for each feature, separately

simply use groupBy and count (perhaps with join) or use windows (with window aggregate functions).

Upvotes: 1

Related Questions