Reputation: 279
Let's say I have some data like:
A B Value
1 1 40
1 2 3
1 2 5
2 1 6
2 2 10
In a dataframe (say 'df'). and I have partitioned it on both A and B as:
df.repartition($"A",$"B")
Now, Let's say we are supposed to count the number of values that are divisible by 2 or by 5 in each partition (separately). It would be unreasonable to maintain as many variables as the number of partitions available. What is the most optimal way to go about this?
(Kindly offer solutions that are applicable in Spark 1.6+)
Upvotes: 2
Views: 346
Reputation: 3692
you can you .mapPartition transformation to do any specific calculation for specific partitions.for ex:
rdd.mapPartition{x=>
var s=0
x.map{
//operation on elements of each partition
}
}
Upvotes: 1