mlee_jordan
mlee_jordan

Reputation: 842

broadcast variable fails to take all data

When applying broadcast variable with collectasmap(), not all the values are included by broadcast variable. e.g.

    val emp = sc.textFile("...text1.txt").map(line => (line.split("\t")(3),line.split("\t")(1))).distinct()
    val emp_new = sc.textFile("...text2.txt").map(line => (line.split("\t")(3),line.split("\t")(1))).distinct()
    emp_new.foreach(println)

    val emp_newBC = sc.broadcast(emp_new.collectAsMap())
    println(emp_newBC.value)

When i checked the values within emp_newBC I saw that not all the data from emp_new appear. What am i missing?

Thanks in advance.

Upvotes: 1

Views: 279

Answers (1)

TheMP
TheMP

Reputation: 8427

The problem is that emp_new is a collection of tuples, while emp_newBC is a broadcasted map. If you are collecting map, the duplicate keys are being removed and therefore you have less data. If you want to get back a list of all tuples, use

val emp_newBC = sc.broadcast(emp_new.collect())

Upvotes: 1

Related Questions