Reputation: 483

How to efficiently find distinct values from each column in Spark

To find distinct values from each column of an Array I tried

RDD[Array[String]].map(_.map(Set(_))).reduce { 
(a, b) => (a.zip(b)).map { case (x, y) => x ++ y}}

which executes successfully. However, I'd like to know if there was a more efficient way of doing this than my sample code above. Thank you.

Upvotes: 5

Answers (1)

The Archetypal Paul

Reputation: 41749

Aggregate saves a step, might or might not be more efficient

val z = Array.fill(5)(Set[String]()) // or whatever the length is
val d= lists.aggregate(z)({(a, b) => (a.zip(b)).map { case (x, y) => x + y}}, 
                          {(a, b) => (a.zip(b)).map { case (x, y) => x ++ y}})

You could also try using mutable sets and modifying rather than producing a new one at each step (which is explicitly allowed by Spark):

val z = Array.fill(5)(scala.collection.mutable.Set[String]())
val d= lists.aggregate(z)({(a, b) => (a.zip(b)).foreach { case (x, y) => x+= y };a},
                          {(a, b) => (a.zip(b)).foreach { case (x, y) => x ++= y};a})

Upvotes: 4

How to efficiently find distinct values from each column in Spark

Answers (1)

Related Questions