Filter a dataFrame based on elemnt from a array column

Question

I'm working with dataframe

       root
          |-- c: long (nullable = true)
          |-- data: array (nullable = true)
          |    |-- element: struct (containsNull = true)
                  |    |    |-- key: string (nullable = true)
                  |    |    |-- value: string (nullable = true)

I'm trying to filter this dataframe based on an element ["value1", "key1"] in the array data i.e if this element exist in data of df so keep it else delete it, I tried

df.filter(col("data").contain("["value1", "key1"])

but it didn't work. Also I tried to put val f=Array("value1", "key1") then df.filter(col("data").contain(f)) it didn't work also.

Any help please?

Ramesh Maharjan · Accepted Answer

Straight forward approach would be to use a udf function as udf function helps to perform logics row by row and in primitive datatypes (thats what your requirement suggests to check every key and value of struct element in array data column)

import org.apache.spark.sql.functions._

//udf to check for key1 in key and value1 in value of every struct in the array field
def containsUdf = udf((data: Seq[Row])=> data.exists(row => row.getAs[String]("key") == "key1" && row.getAs[String]("value") == "value1"))
//calling the udf function in the filter 
val filteredDF = df.filter(containsUdf(col("data")))

so the filteredDF should be your desired output

Filter a dataFrame based on elemnt from a array column

Answers (1)

Related Questions