aswa09
aswa09

Reputation: 279

SPARK: Maintaining different variables for different partitions?

Let's say I have some data like:

A B Value
1 1 40
1 2 3
1 2 5
2 1 6
2 2 10

In a dataframe (say 'df'). and I have partitioned it on both A and B as:

df.repartition($"A",$"B")

Now, Let's say we are supposed to count the number of values that are divisible by 2 or by 5 in each partition (separately). It would be unreasonable to maintain as many variables as the number of partitions available. What is the most optimal way to go about this?

(Kindly offer solutions that are applicable in Spark 1.6+)

Upvotes: 2

Views: 346

Answers (1)

Sandeep Purohit
Sandeep Purohit

Reputation: 3692

you can you .mapPartition transformation to do any specific calculation for specific partitions.for ex:

rdd.mapPartition{x=> 
var s=0
x.map{
   //operation on elements of each partition 
} 
}

Upvotes: 1

Related Questions