Reputation: 896
New to Spark.
I'd like to do some transformation on the "wordList" column of a spark DataFrame, df, of the type org.apache.spark.sql.DataFrame = [id: string, wordList: array<string>]
.
I use dataBricks. df looks like:
+--------------------+--------------------+
| id| wordList|
+--------------------+--------------------+
|08b0a9b6-3b9a-47a...| [a]|
|23c2ef79-8dce-4ad...|[ag, adfg, asdfgg...|
|26a7682f-2ce6-4eb...|[ghe, gener, ghee...|
|2ab530b5-04bc-463...|[bap, pemm, pava,...|
+--------------------+--------------------+
More specifically, I have defined a function shrinkList(ol: List[String]): List[String] that takes a list and returns a shorter list, and would like to apply it on the wordList column. The question is, how do I convert the row to a list?
df.select("wordList").map(t => shrinkList(t(1)))
give the error: type mismatch;
found : Any
required: List[String]
Also, I'm not sure about "t(1)" here. I'd rather use the column name instead of the index, in case the order of the columns change in the future. But I can't seem to make t$"wordList" or t.wordList or t("wordList") work. So instead of using t(1), what selector can I use to select the "wordList" column?
Upvotes: 0
Views: 771
Reputation:
Try:
df.select("wordList").map(t => shrinkList(t.getSeq[String](0).toList))
or
df.select("wordList").map(t => shrinkList(t.getAs[Seq[String]]("wordList").toList))
Upvotes: 1