Sugimiyanto
Sugimiyanto

Reputation: 329

Append/concatenate two RDDs of type Set in Apache Spark

I am working with Spark RDD. I need to append/concatenate two RDDs of type Set.

scala> var ek: RDD[Set[Int]] = sc.parallelize(Seq(Set(7)))
ek: org.apache.spark.rdd.RDD[Set[Int]] = ParallelCollectionRDD[31] at parallelize at <console>:32

scala> val vi: RDD[Set[Int]] = sc.parallelize(Seq(Set(3,5)))
vi: org.apache.spark.rdd.RDD[Set[Int]] = ParallelCollectionRDD[32] at parallelize at <console>:32

scala> val z = vi.union(ek)
z: org.apache.spark.rdd.RDD[Set[Int]] = UnionRDD[34] at union at <console>:36

scala> z.collect
res15: Array[Set[Int]] = Array(Set(3, 5), Set(7))

scala> val t = visited++ek
t: org.apache.spark.rdd.RDD[Set[Int]] = UnionRDD[40] at $plus$plus at <console>:36

scala> t.collect
res30: Array[Set[Int]] = Array(Set(3, 5), Set(7))

I have tried using two operators, union and ++. However, it does not produce the expected result.

Array(Set(3, 5), Set(7))

The expected result should be like this:

scala> val u = Set(3,5)
u: scala.collection.immutable.Set[Int] = Set(3, 5)

scala> val o = Set(7)
o: scala.collection.immutable.Set[Int] = Set(7)

scala> u.union(o)
res28: scala.collection.immutable.Set[Int] = Set(3, 5, 7)

Can anybody give me direction how to do it

Upvotes: 1

Views: 2370

Answers (1)

Mousa
Mousa

Reputation: 3036

You are applying the union on a list (seq) of sets that is why the elements are the complete sets and not their elements. Try using:

var ek = sc.parallelize(Set(7).toSeq)
val vi = sc.parallelize(Set(3,5).toSeq)
val z = vi.union(ek)

Upvotes: 4

Related Questions