Reputation: 65
I have a Spark Dataframe which I aggregated based on a column called "Rank", with "Sum" beging the Sum of all values with that Rank.
df.groupBy("Rank").agg(sum("Col1")).orderBy("Rank").show()
Rank | Sum(Col1) |
---|---|
1 | 1523 |
2 | 785 |
3 | 232 |
4 | 69 |
5 | 126 |
... | .... |
430 | 7 |
Instead of having the Sum for every single value of "Rank", I would like to group my data into rank "buckets", to get a more compact output. For example :
Rank Range | Sum(Col1) |
---|---|
1 | 1523 |
2-5 | 1212 |
5-10 | ... |
... | ... |
100+ | .... |
Instead of having 4 different rows for Rank 2,3,4,5 - I would like to have one row "2-5" showing the sum for all these ranks. What would be the best way of doing that ? I am quite new to Spark Dataframes and am thankful for any help and especially examples on how to achieve that
Thank you !
Upvotes: 1
Views: 888
Reputation: 7009
Few options:
Histogram - build a histogram. See the following post: Making histogram with Spark DataFrame column
Add another column for the bucket values (See Apache spark dealing with case statements):
df.select(when(people("Rank Range") === "1", "1")
.when(..., "2-5")
.otherwise("100"))
Now you can run your group by query on the new Rank Range
column.
Upvotes: 1