Sash9
Sash9

Reputation: 65

Spark Dataframe : Group by custom Range

I have a Spark Dataframe which I aggregated based on a column called "Rank", with "Sum" beging the Sum of all values with that Rank.

df.groupBy("Rank").agg(sum("Col1")).orderBy("Rank").show()

Rank Sum(Col1)
1 1523
2 785
3 232
4 69
5 126
... ....
430 7

Instead of having the Sum for every single value of "Rank", I would like to group my data into rank "buckets", to get a more compact output. For example :

Rank Range Sum(Col1)
1 1523
2-5 1212
5-10 ...
... ...
100+ ....

Instead of having 4 different rows for Rank 2,3,4,5 - I would like to have one row "2-5" showing the sum for all these ranks. What would be the best way of doing that ? I am quite new to Spark Dataframes and am thankful for any help and especially examples on how to achieve that

Thank you !

Upvotes: 1

Views: 888

Answers (1)

Yosi Dahari
Yosi Dahari

Reputation: 7009

Few options:

  1. Histogram - build a histogram. See the following post: Making histogram with Spark DataFrame column

  2. Add another column for the bucket values (See Apache spark dealing with case statements):

df.select(when(people("Rank Range") === "1", "1")
     .when(..., "2-5")
     .otherwise("100"))

Now you can run your group by query on the new Rank Range column.

Upvotes: 1

Related Questions