vishalraj
vishalraj

Reputation: 115

Collect Spark dataframe column value to set

I have circumstances where i need to collect column values as Set() in spark dataframe, to find the difference with other set. I have following 2 dataframe

DF1
+----+---------+----------+----+-----------------+
|Lily|Sunflower|Windflower|Rose|Snapdragon Flower|
+----+---------+----------+----+-----------------+
|1   |2        |3         |4   |5                |
+----+---------+----------+----+-----------------+

DF2
+-----------------+
|Flowers          |
+-----------------+
|Rose             |
|Lily             |
|Tulip            |
|Orchid           |
|Snapdragon Flower|
+-----------------+

I want to find the set difference between column names of DF1 to values of Flower column from DF2. For this i have written following code but creates empty value in their set difference. CODE:

import sparkSession.sqlContext.implicits._
val df1 = Seq(("1", "2", "3", "4", "5")).toDF("Lily", "Sunflower", "Windflower", "Rose", "Snapdragon Flower")
val df2 = Seq("Rose", "Lily", "Tulip", "Orchid", "Snapdragon Flower").toDF("Flowers")

val set1 = df1.columns.toSet
println(s"set1 => ${set1}")

val flower_values = df2.select("Flowers").collectAsList()
var set2 = Set("") //introduce empty String Type column
for (i <- 0 until flower_values.size()) {
  var col = flower_values.get(i).toString()
  set2 += col.substring(1, col.size - 1)
}
println(s"set2 => ${set2}")

val dif_btw_set2_and_set1 = set2.diff(set1)
println(s"dif_btw_set2_and_set1 => ${dif_btw_set2_and_set1}")

OUTPUT:

set1 => Set(Sunflower, Rose, Windflower, Snapdragon Flower, Lily)
set2 => Set(, Orchid, Rose, Snapdragon Flower, Tulip, Lily)
dif_btw_set2_and_set1 => Set(, Orchid, Tulip)

Can this be done in more elegant way in Scala-Spark?

Upvotes: 2

Views: 7508

Answers (2)

rajesh
rajesh

Reputation: 172

I hope this will help, it gives you the column's values in Set

val set2 = df2.select("Flowers").as[String].collect().toSet

Upvotes: 5

aax
aax

Reputation: 454

This will give you the second dataframe as set of strings:

val flower_values = df2.select("Flowers").collectAsList()
val set2 = flower_values.map(_.getString(0)).toSet

collectAsList will give you a List[Row]. To take the string value from the first column of a row you need getString(0). To convert List[Row] to Set[String] you can use map to traverse over the list and toSet to finally convert to a set.

Note: Set("") creates a set with one element (empty string). Set() creates an empty set. But creating a set like this is not even necessary for this example.

Upvotes: 0

Related Questions