SPARK: Maintaining different variables for different partitions?

Question

Let's say I have some data like:

A B Value
1 1 40
1 2 3
1 2 5
2 1 6
2 2 10

In a dataframe (say 'df'). and I have partitioned it on both A and B as:

df.repartition($"A",$"B")

Now, Let's say we are supposed to count the number of values that are divisible by 2 or by 5 in each partition (separately). It would be unreasonable to maintain as many variables as the number of partitions available. What is the most optimal way to go about this?

(Kindly offer solutions that are applicable in Spark 1.6+)

Sandeep Purohit · Accepted Answer

you can you .mapPartition transformation to do any specific calculation for specific partitions.for ex:

rdd.mapPartition{x=> 
var s=0
x.map{
   //operation on elements of each partition 
} 
}

SPARK: Maintaining different variables for different partitions?

Answers (1)

Related Questions