verojoucla
verojoucla

Reputation: 649

How filter in an Array column values in Pyspark

I have a pyspark Dataframe that contain many columns, among them column as an Array type and a String column:

numbers  <Array>              |    name<String>
------------------------------|----------------
["160001","160021"]           |     A
------------------------------|----------------
["160001","1600", "42345"]    |     B
------------------------------|----------------
["160001","9867", "42345"]    |     C
------------------------------|----------------
["160001","8650", "2345"]     |     A
------------------------------|----------------
["2456","78568", "42345"]     |     B
-----------------------------------------------

I want to skip the numbers that contain 4 digits from the numbers column if the name column is not "B". And keep it if the name column is "B". For example:

In the lines 2 and 5, I have "1600" and "2456" contains 4 digits and the name column is "B", I should keep them from the column values:

------------------------------|----------------
["160001","1600", "42345"]    |     B
------------------------------|----------------
["2456","78568", "42345"]     |     B
-----------------------------------------------

In the line 3 and 4, I have numbers column that contain a numbers of 4 digit but the column name is different to "B" ==> So I should skip them.

Example:

------------------------------|----------------
["160001","9867", "42345"]    |     C
------------------------------|----------------
["160001","8650", "2345"]     |     A
------------------------------|----------------

Expect result:

    numbers  <Array>              |    name<String>
------------------------------|----------------
["160001","160021"]           |     A
------------------------------|----------------
["160001","1600", "42345"]    |     B
------------------------------|----------------
["160001", "42345"]           |     C
------------------------------|----------------
["160001"]                    |     A
------------------------------|----------------
["2456","78568", "42345"]     |     B
-----------------------------------------------

How can I do it ? Thank you

Upvotes: 1

Views: 4290

Answers (2)

David Vrba
David Vrba

Reputation: 3344

Since Spark 2.4 you can use higher order function FILTER to filter the array. Combining this with if expression should solve the problem:

df.selectExpr("if(name != \'B', FILTER(numbers, x -> length(x) != 4), numbers) AS numbers", "name")

Upvotes: 2

pissall
pissall

Reputation: 7399

You need to write a udf to do the array filter and use it with a when clause to apply the udf on a specific condition like where name == B:

from pysparl.sql.functions import udf, col, when
from pyspark.sql.types import ArrayType, StringType

filter_array_udf = udf(lambda arr: [x for x in arr if len(x) > 4], "array<string>")
# if string schema is not supported then use the next commented line
# filter_array_udf = udf(filter_array, ArrayType(StringType()))

df = df.withColumn("numbers", when(col("name") == "B", filter_array_udf(col("numbers"))).otherwise(col("numbers")))

Upvotes: 1

Related Questions