Marsellus Wallace
Marsellus Wallace

Reputation: 18601

Filter spark/scala dataframe if column is present in set

I'm using Spark 1.4.0, this is what I have so far:

data.filter($"myColumn".in(lit("A"), lit("B"), lit("C"), ...))

The function lit converts a literal to a column.

Ideally I would put my A, B, C in a Set and check like this:

val validValues = Set("A", "B", "C", ...)
data.filter($"myColumn".in(validValues))

What's the correct syntax? Are there any alternative concise solutions?

Upvotes: 5

Views: 10682

Answers (2)

DB Tsai
DB Tsai

Reputation: 1398

This PR has been merged into Spark 2.4. You can now do

val profileDF = Seq(
  Some(1), Some(2), Some(3), Some(4),
  Some(5), Some(6), Some(7), None
).toDF("profileID")

val validUsers: Set[Any] = Set(6, 7.toShort, 8L, "3")

val result = profileDF.withColumn("isValid", $"profileID".isInCollection(validUsers))

result.show(10)
"""
+---------+-------+
|profileID|isValid|
+---------+-------+
|        1|  false|
|        2|  false|
|        3|   true|
|        4|  false|
|        5|  false|
|        6|   true|
|        7|   true|
|     null|   null|
+---------+-------+
 """.stripMargin

Upvotes: 6

Paweł Jurczenko
Paweł Jurczenko

Reputation: 4471

Spark 1.4 or older:

val validValues = Set("A", "B", "C").map(lit(_))
data.filter($"myColumn".in(validValues.toSeq: _*))

Spark 1.5 or newer:

val validValues = Set("A", "B", "C")
data.filter($"myColumn".isin(validValues.toSeq: _*))

Upvotes: 11

Related Questions