Reputation: 2441
Spark 2.4 introduced new useful Spark SQL functions involving arrays, but I was a little bit puzzled when I found out that the result of
select array_remove(array(1, 2, 3, null, 3), null)
is null
and not [1, 2, 3, 3]
.
Is this the expected behavior? Is it possible to remove nulls using array_remove
?
As a side note, for now the alternative I am using is a higher order function in Databricks:
select filter(array(1, 2, 3, null, 3), x -> x is not null)
Upvotes: 23
Views: 25248
Reputation: 24386
Spark 3.4+
array_compact("col_name")
Full PySpark example:
from pyspark.sql import functions as F
df = spark.createDataFrame([([3, None, 3],)], ["c"])
df.show()
# +------------+
# | c|
# +------------+
# |[3, null, 3]|
# +------------+
df = df.withColumn("c", F.array_compact("c"))
df.show()
# +------+
# | c|
# +------+
# |[3, 3]|
# +------+
Upvotes: 17
Reputation: 812
To answer your first question, "Is this the expected behavior?", yes. Because the official notebook (https://docs.databricks.com/_static/notebooks/apache-spark-2.4-functions.html) points out "Remove all elements that equal to the given element from the given array." and
NULL
corresponds to undefined values and the results will also be not defined.
So, I think NULL
s are out of the purview of this function.
Better you found out a way to overcome this, you can also use spark.sql("""SELECT array_except(array(1, 2, 3, 3, null, 3, 3,3, 4, 5), array(null))""").show()
, but the downside is that the result will be without duplicates.
Upvotes: 18
Reputation: 808
If you also want to get rid of duplicates, returning each distinct non-NULL value exactly once, you can use array_except
:
f.array_except(f.col("array_column_with_nulls"), f.array(f.lit(None)))
or, equivalent, SQL like this:
array_except(your_array_with_NULLs, array())
Upvotes: -1
Reputation: 2472
I don't think you can use array_remove() or array_except() for your problem. However, though it's not a very good solution, but it may help.
@F.udf("array<string>")
def udf_remove_nulls(arr):
return [i for i in arr if i is not None]
df = df.withColumn("col_wo_nulls", udf_remove_nulls(df["array_column"]))
Upvotes: -1
Reputation: 33449
You can do something like this in Spark 2:
import org.apache.spark.sql.functions._
import org.apache.spark.sql._
/**
* Array without nulls
* For complex types, you are responsible for passing in a nullPlaceholder of the same type as elements in the array
*/
def non_null_array(columns: Seq[Column], nullPlaceholder: Any = "רכוב כל יום"): Column =
array_remove(array(columns.map(c => coalesce(c, lit(nullPlaceholder))): _*), nullPlaceholder)
In Spark 3, there is new array filter function and you can do:
df.select(filter(col("array_column"), x => x.isNotNull))
Upvotes: 8
Reputation: 184
https://docs.databricks.com/_static/notebooks/apache-spark-2.4-functions.html
array_remove(array, T): array Remove all elements that equal to the given element from the given array.
Note: I only referred the documentation and they have taken the same data. **null can never be equal to null.
Upvotes: 7