Ged
Ged

Reputation: 18053

size function applied to empty array column in dataframe returns 1 after spilt

Noticed that with size function on an array column in a dataframe using following code - which includes a split:

import org.apache.spark.sql.functions.{trim, explode, split, size}

val df1 = Seq(
  (1, "[{a},{b},{c}]"),
  (2, "[]"),
  (3, "[{d},{e},{f}]")
).toDF("col1", "col2")
df1.show(false)

val df2 = df.withColumn("cola", split(trim($"col2", "[]"), ",")).withColumn("s", size($"cola"))
df2.show(false)

we get:

+----+-------------+---------------+---+
|col1|col2         |cola           |s  |
+----+-------------+---------------+---+
|1   |[{a},{b},{c}]|[{a}, {b}, {c}]|3  |
|2   |[]           |[]             |1  |
|3   |[{d},{e},{f}]|[{d}, {e}, {f}]|3  |
+----+-------------+---------------+---+

I was hoping for a zero so as to be able distinguish between 0 or 1 entries.

A few hints here and there on SO, but none that helped.

If I have the following entry: (2, null), then I get size -1, which is more helpful I guess.

On the other hand, this borrowed sample from the internet:

val df = Seq("a" -> Array(1,2,3), "b" -> null, "c" -> Array(7,8,9)).toDF("id","numbers")
df.show
val df2 = df.withColumn("numbers", coalesce($"numbers", array()))
df2.show
val df3 = df2.withColumn("s", size($"numbers"))
df3.show()

does return 0 - as expected.

Looking for the correct approach here so as to get size = 0.

Upvotes: 0

Views: 1617

Answers (2)

blackbishop
blackbishop

Reputation: 32680

This behavior is inherited from the Java function split which is used in the same way in Scala and Spark. The empty input is a special case, and this is well discussed in this SO post.

Spark sets the default value for the second parameter (limit) of the split function to -1. And as of Spark 3, we can now pass a limit parameter for split function.

You can see this in Scala split function vs Spark SQL split function:

"".split(",").length
//res31: Int = 1

spark.sql("""select size(split("", '[,]'))""").show
//+----------------------+
//|size(split(, [,], -1))|
//+----------------------+
//|                     1|
//+----------------------+

And

",".split(",").length // without setting limit=-1 this gives empty array
//res33: Int = 0

",".split(",", -1).length
//res34: Int = 2

spark.sql("""select size(split(",", '[,]'))""").show
//+-----------------------+
//|size(split(,, [,], -1))|
//+-----------------------+
//|                      2|
//+-----------------------+

Upvotes: 1

mck
mck

Reputation: 42392

I suppose the root cause is that split returns an empty string, instead of a null.

scala> df1.withColumn("cola", split(trim($"col2", "[]"), ",")).withColumn("s", $"cola"(0)).select("s").collect()(1)(0)
res53: Any = ""

And the size of an array containing an empty string is, of course, 1.

To get around this, perhaps you could do

val df2 = df1.withColumn("cola", split(trim($"col2", "[]"), ","))
             .withColumn("s", when(length($"cola"(0)) =!= 0, size($"cola"))
                              .otherwise(lit(0)))

df2.show(false)
+----+-------------+---------------+---+
|col1|col2         |cola           |s  |
+----+-------------+---------------+---+
|1   |[{a},{b},{c}]|[{a}, {b}, {c}]|3  |
|2   |[]           |[]             |0  |
|3   |[{d},{e},{f}]|[{d}, {e}, {f}]|3  |
+----+-------------+---------------+---+

Upvotes: 0

Related Questions