Reputation: 483
To find distinct values from each column of an Array
I tried
RDD[Array[String]].map(_.map(Set(_))).reduce {
(a, b) => (a.zip(b)).map { case (x, y) => x ++ y}}
which executes successfully. However, I'd like to know if there was a more efficient way of doing this than my sample code above. Thank you.
Upvotes: 5
Views: 2379
Reputation: 41749
Aggregate saves a step, might or might not be more efficient
val z = Array.fill(5)(Set[String]()) // or whatever the length is
val d= lists.aggregate(z)({(a, b) => (a.zip(b)).map { case (x, y) => x + y}},
{(a, b) => (a.zip(b)).map { case (x, y) => x ++ y}})
You could also try using mutable sets and modifying rather than producing a new one at each step (which is explicitly allowed by Spark):
val z = Array.fill(5)(scala.collection.mutable.Set[String]())
val d= lists.aggregate(z)({(a, b) => (a.zip(b)).foreach { case (x, y) => x+= y };a},
{(a, b) => (a.zip(b)).foreach { case (x, y) => x ++= y};a})
Upvotes: 4