Reihan_amn
Reihan_amn

Reputation: 2737

Filtering records in pyspark dataframe if the struct Array contains a record

My Pyspark dataframe looks like this:

|-- name: string (nullable = true)
 |-- other_attr: array (nullable = true)
 |    |-- element: struct (containsNull = true)
 |    |    |-- key: string (nullable = true)
 |    |    |-- value: string (nullable = true)

I am looking for the rows that don't have [Closed, Yes] in their array of struct under other_attr. The other_attr is an array of struct which could be an empty array. How could I run this filtering?

Upvotes: 1

Views: 1764

Answers (3)

programort
programort

Reputation: 159

If you want to code it with Spark SQL it is also possible:
Since Spark 2.4.0 you can use the functions exist.

Example with SparkSQL:

SELECT
    EXISTS
    (
        ARRAY(named_struct("key": "a", "value": "1"), named_struct("key": "b", "value": "2")),
        x -> x = named_struct("key": "a", "value": "1")
    )

Example with PySpark:

df.filter('exists(extra_features, x -> x = named_struct("key": "a", "value": "1"))')

Note that not all the functions to manipulate arrays start with array_*.
Ex: exist, filter, size, ...

Upvotes: 0

pltc
pltc

Reputation: 6082

You can simply use array_contains to check against the struct [Closed, Yes] like so

import pyspark.sql.functions as F

df.show()
# +-----+---------------+
# | name|     other_attr|
# +-----+---------------+
# |test1|[{Closed, Yes}]|
# |test2| [{Closed, No}]|
# |test3|             []|
# +-----+---------------+

(df.where(~F
        .array_contains('other_attr', F.struct(
            F.lit('Closed').alias('key'),
            F.lit('Yes').alias('value'),
        ))
    ).show()
)

# Output
# +-----+--------------+
# | name|    other_attr|
# +-----+--------------+
# |test2|[{Closed, No}]|
# |test3|            []|
# +-----+--------------+

Upvotes: 2

Drashti Dobariya
Drashti Dobariya

Reputation: 3006

You can use to_json function with contains to filter rows based on criteria.

import pyspark.sql.functions as F

df2 = df.filter(
    ~F.to_json('other_attr').contains(
        F.to_json(
            F.struct(
                F.lit('Closed').alias('key'),
                F.lit('Yes').alias('value')
            )
        )
    )
)

Upvotes: 1

Related Questions