Reputation: 7
I'm a beginner with Scala.
I've got a dataframe with 2 columns :
the first is a date, the second an array of words.
created_at:string
words:array
element:string
I wish to keep only words begining with a '#'
I would prefer to make the filter before exploding the array, as most words do not start with a '#'
I didn't find a way to modify an array column and apply something like a filter(_.startsWith("#")).
Is it possible ? and how ?
Thank's
Pierre
Upvotes: 0
Views: 794
Reputation: 1085
Try this one:
import org.apache.spark.sql.functions._
df.select(explode(col("words")).as("word"), col("created_at")).
where("word LIKE '#%'").
groupBy(col("created_at")).
agg(collect_set(col("word")).as("words")).
show
Upvotes: 0
Reputation: 22449
You can create a simple UDF to filter out the unwanted words from your array column:
val df = Seq(
("2018-05-01", Seq("a", "#b", "c")),
("2018-05-02", Seq("#d", "#e", "f"))
).toDF("created_at", "words")
def filterArray = udf( (s: Seq[String]) =>
s.filterNot(_.startsWith("#"))
)
df.select($"created_at", filterArray($"words")).show
// +----------+----------+
// |created_at|UDF(words)|
// +----------+----------+
// |2018-05-01| [a, c]|
// |2018-05-02| [f]|
// +----------+----------+
Upvotes: 3