Reputation: 1

Group data based on multiple column in spark using scala's API

I have an RDD, want to group data based on multiple column. for large dataset spark cannot work using combineByKey, groupByKey, reduceByKey and aggregateByKey, these gives heap space error. Can you give another method for resolving it using Scala's API?

Upvotes: 0

Answers (1)

C4stor

Reputation: 8036

You may want to use treeReduce() for doing incremental reduce in Spark. However, you hypothesis that spark can not work on large dataset is not true, and I suspect you just don't have enough partitions in your data, so maybe a repartition() is what you need.

Upvotes: 1

Group data based on multiple column in spark using scala&#39;s API

Answers (1)

Related Questions

Group data based on multiple column in spark using scala's API