Brian
Brian

Reputation: 976

filter or label rows based on a Scala array

Is there a way to filter or label rows based on a Scala array?

Please keep in mind in reality there the number of rows is much larger.

sample data

val clients= List(List("1", "67") ,List("2", "77") ,List("3", "56"),List("4","90")).map(x =>(x(0), x(1)))
val df = clients.toDF("soc","ages")

+---+----+
|soc|ages|
+---+----+
|  1|  67|
|  2|  77|
|  3|  56|
|  4|  90|
| ..|  ..|
+---+----+

I would like to filter all the ages that are in a Scala array lets say

var z = Array(90, 56,67).
df.where(($"ages" IN z)

or

df..withColumn("flag", when($"ages" >= 30 , 1)
              .otherwise(when($"ages" <= 5, 2)
                .otherwise(3))

Upvotes: 2

Views: 122

Answers (2)

notNull
notNull

Reputation: 31510

You can also pass each element as an arg by using _* operator for an Array.

Then write an case when otherwise using isin

Ex:

val df1 = Seq((1, 67), (2, 77), (3, 56), (4, 90)).toDF("soc", "ages")
val z = Array(90, 56,67)
df1.withColumn("flag", 
                     when('ages.isin(z: _*), "in Z array")
                     .otherwise("not in Z array"))
                     .show(false)
+---+----+--------------+
|soc|ages|flag          |
+---+----+--------------+
|1  |67  |in Z array    |
|2  |77  |not in Z array|
|3  |56  |in Z array    |
|4  |90  |in Z array    |
+---+----+--------------+

Upvotes: 4

C.S.Reddy Gadipally
C.S.Reddy Gadipally

Reputation: 1758

one option is an udf.

scala> val df1 = Seq((1, 67), (2, 77), (3, 56), (4, 90)).toDF("soc", "ages")
df1: org.apache.spark.sql.DataFrame = [soc: int, ages: int]

scala> df1.show
+---+----+
|soc|ages|
+---+----+
|  1|  67|
|  2|  77|
|  3|  56|
|  4|  90|
+---+----+


scala> val scalaAgesArray = Array(90, 56,67)
scalaAgesArray: Array[Int] = Array(90, 56, 67)

scala> val containsAgeUdf = udf((x: Int) => scalaAgesArray.contains(x))
containsAgeUdf: org.apache.spark.sql.expressions.UserDefinedFunction = UserDefinedFunction(<function1>,BooleanType,Some(List(IntegerType)))

scala> val outputDF = df1.withColumn("flag", containsAgeUdf($"ages"))
outputDF: org.apache.spark.sql.DataFrame = [soc: int, ages: int ... 1 more field]

scala> outputDF.show(false)
+---+----+-----+
|soc|ages|flag |
+---+----+-----+
|1  |67  |true |
|2  |77  |false|
|3  |56  |true |
|4  |90  |true |
+---+----+-----+

Upvotes: 3

Related Questions