Reputation: 81
I have a dataset with some categorical features. I am trying to apply exact same function on all of these categorical features in Spark framework. My first assumption was that I can parallelize operation of each feature with operation of other features. However I couldn't figure out is it possible or not (confused after reading this, this).
For example, assume that my dataset is as following:
feature1, feature2, feature3
blue,apple,snake
orange,orange,monkey
blue,orange,horse
I want to count the number of occurrences of each category for each feature, separately. For example for feature1 (blue=2, orange=1)
Upvotes: 4
Views: 579
Reputation: 74779
TL;DR Spark SQL's DataFrames are not split per column but per rows so Spark processes group of rows per task (not columns) unless you split the source dataset using select
-like operator.
If you want to:
count the number of occurrences of each category for each feature, separately
simply use groupBy
and count
(perhaps with join
) or use windows (with window aggregate functions).
Upvotes: 1