Reputation: 624
I have a df like this
val df2 = spark.createDataFrame(
Seq(
(0, "this is a sentence"),
(1, "And another sentence")
)
).toDF("num", "words")
and I would like to get the distinct words in this column like
val vocab = List("this", "is", "a", "sentence", "And", "another")
What is a scala/spark-esque way of achieving this?
PS I know I could hack away at this with for loops and such but I am trying to get better at functional programming and more specifically spark and scala.
Upvotes: 0
Views: 144
Reputation: 21
Here is a very silly answer:
import spark.implicits._
df2
.as[(Int, String)]
.flatMap { case (_, words) => words.split(' ') }
.distinct
.show(false)
I think this is what you want?
+--------+
|value |
+--------+
|sentence|
|this |
|is |
|a |
|And |
|another |
+--------+
Or were you more after a single row that contains all the distinct words?
(also this is my first ever stack overflow answer so pls be nice <3)
Upvotes: 1