Raji
Raji

Reputation: 1

Group data based on multiple column in spark using scala's API

I have an RDD, want to group data based on multiple column. for large dataset spark cannot work using combineByKey, groupByKey, reduceByKey and aggregateByKey, these gives heap space error. Can you give another method for resolving it using Scala's API?

Upvotes: 0

Views: 732

Answers (1)

C4stor
C4stor

Reputation: 8036

You may want to use treeReduce() for doing incremental reduce in Spark. However, you hypothesis that spark can not work on large dataset is not true, and I suspect you just don't have enough partitions in your data, so maybe a repartition() is what you need.

Upvotes: 1

Related Questions