Reputation: 2737
My Pyspark dataframe looks like this:
|-- name: string (nullable = true)
|-- other_attr: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- key: string (nullable = true)
| | |-- value: string (nullable = true)
I am looking for the rows that don't have [Closed, Yes] in their array of struct under other_attr. The other_attr is an array of struct which could be an empty array. How could I run this filtering?
Upvotes: 1
Views: 1764
Reputation: 159
If you want to code it with Spark SQL it is also possible:
Since Spark 2.4.0 you can use the functions exist.
Example with SparkSQL:
SELECT
EXISTS
(
ARRAY(named_struct("key": "a", "value": "1"), named_struct("key": "b", "value": "2")),
x -> x = named_struct("key": "a", "value": "1")
)
Example with PySpark:
df.filter('exists(extra_features, x -> x = named_struct("key": "a", "value": "1"))')
Note that not all the functions to manipulate arrays start with array_*.
Ex: exist, filter, size, ...
Upvotes: 0
Reputation: 6082
You can simply use array_contains
to check against the struct [Closed, Yes]
like so
import pyspark.sql.functions as F
df.show()
# +-----+---------------+
# | name| other_attr|
# +-----+---------------+
# |test1|[{Closed, Yes}]|
# |test2| [{Closed, No}]|
# |test3| []|
# +-----+---------------+
(df.where(~F
.array_contains('other_attr', F.struct(
F.lit('Closed').alias('key'),
F.lit('Yes').alias('value'),
))
).show()
)
# Output
# +-----+--------------+
# | name| other_attr|
# +-----+--------------+
# |test2|[{Closed, No}]|
# |test3| []|
# +-----+--------------+
Upvotes: 2
Reputation: 3006
You can use to_json function with contains to filter rows based on criteria.
import pyspark.sql.functions as F
df2 = df.filter(
~F.to_json('other_attr').contains(
F.to_json(
F.struct(
F.lit('Closed').alias('key'),
F.lit('Yes').alias('value')
)
)
)
)
Upvotes: 1