Spark: Which is the proper way to use a Broadcast variable?

Question

I don't know if I'm using well a broadcast variable.

I have two RDDs, rdd1 and rdd2. I want to apply rdd2.mapPartitionsWithIndex(...), and for each partition I need to perfom some calculation using the whole rdd1. So, I think this is a case to use a Broadcast variable. First question: Am I thinking it right?

To do so, I did this:

val rdd1Broadcast =  sc.broadcast(rdd1.collect())

Second question: Why do I need to put .collect(). I saw examples with and without .collect(), but I didn't realized when do I need to use it.

Also, I did this:

val rdd3 = rdd2.mapPartitionsWithIndex( myfunction(_, _, rdd1Broadcast), preservesPartitioning = preserves).cache()

Third question: Which is better: passing rdd1Broadcast or rdd1Broadcast.value?

Alper t. Turker · Accepted Answer

Am I thinking it right?

There is really not enough information to answer this part. Broadcasting is useful only if broadcasted object is relatively small, or local access significantly reduces computational complexity.

Why do I need to put .collect().

Because RDDs can be accessed only on the driver. Broadcasting RDD is not meaningful, as you cannot access the data from a task.

Which is better: passing rdd1Broadcast or rdd1Broadcast.value?

The argument should be of type Broadcast[_] so don't use rdd1Broadcast.value. If parameter is passed by value, it will be evaluated and substituted locally, and broadcast will not be used.

Spark: Which is the proper way to use a Broadcast variable?

Answers (1)

Related Questions