Reputation: 649
I have a pyspark Dataframe
that contain many columns, among them column as an Array type and a String column:
numbers <Array> | name<String>
------------------------------|----------------
["160001","160021"] | A
------------------------------|----------------
["160001","1600", "42345"] | B
------------------------------|----------------
["160001","9867", "42345"] | C
------------------------------|----------------
["160001","8650", "2345"] | A
------------------------------|----------------
["2456","78568", "42345"] | B
-----------------------------------------------
I want to skip the numbers that contain 4 digits from the numbers column if the name column is not "B".
And keep it if the name column is "B".
For example:
In the lines 2 and 5, I have "1600" and "2456" contains 4 digits
and the name column is "B", I should keep them from the column values:
------------------------------|----------------
["160001","1600", "42345"] | B
------------------------------|----------------
["2456","78568", "42345"] | B
-----------------------------------------------
In the line 3 and 4, I have numbers column that contain a numbers of 4 digit but the column name is different to "B" ==> So I should skip them.
Example:
------------------------------|----------------
["160001","9867", "42345"] | C
------------------------------|----------------
["160001","8650", "2345"] | A
------------------------------|----------------
Expect result:
numbers <Array> | name<String>
------------------------------|----------------
["160001","160021"] | A
------------------------------|----------------
["160001","1600", "42345"] | B
------------------------------|----------------
["160001", "42345"] | C
------------------------------|----------------
["160001"] | A
------------------------------|----------------
["2456","78568", "42345"] | B
-----------------------------------------------
How can I do it ? Thank you
Upvotes: 1
Views: 4290
Reputation: 3344
Since Spark 2.4 you can use higher order function FILTER
to filter the array. Combining this with if
expression should solve the problem:
df.selectExpr("if(name != \'B', FILTER(numbers, x -> length(x) != 4), numbers) AS numbers", "name")
Upvotes: 2
Reputation: 7399
You need to write a udf to do the array filter and use it with a when
clause to apply the udf on a specific condition like where name == B
:
from pysparl.sql.functions import udf, col, when
from pyspark.sql.types import ArrayType, StringType
filter_array_udf = udf(lambda arr: [x for x in arr if len(x) > 4], "array<string>")
# if string schema is not supported then use the next commented line
# filter_array_udf = udf(filter_array, ArrayType(StringType()))
df = df.withColumn("numbers", when(col("name") == "B", filter_array_udf(col("numbers"))).otherwise(col("numbers")))
Upvotes: 1