size function applied to empty array column in dataframe returns 1 after spilt

Question

Noticed that with size function on an array column in a dataframe using following code - which includes a split:

import org.apache.spark.sql.functions.{trim, explode, split, size}

val df1 = Seq(
  (1, "[{a},{b},{c}]"),
  (2, "[]"),
  (3, "[{d},{e},{f}]")
).toDF("col1", "col2")
df1.show(false)

val df2 = df.withColumn("cola", split(trim($"col2", "[]"), ",")).withColumn("s", size($"cola"))
df2.show(false)

we get:

+----+-------------+---------------+---+
|col1|col2         |cola           |s  |
+----+-------------+---------------+---+
|1   |[{a},{b},{c}]|[{a}, {b}, {c}]|3  |
|2   |[]           |[]             |1  |
|3   |[{d},{e},{f}]|[{d}, {e}, {f}]|3  |
+----+-------------+---------------+---+

I was hoping for a zero so as to be able distinguish between 0 or 1 entries.

A few hints here and there on SO, but none that helped.

If I have the following entry: (2, null), then I get size -1, which is more helpful I guess.

On the other hand, this borrowed sample from the internet:

val df = Seq("a" -> Array(1,2,3), "b" -> null, "c" -> Array(7,8,9)).toDF("id","numbers")
df.show
val df2 = df.withColumn("numbers", coalesce($"numbers", array()))
df2.show
val df3 = df2.withColumn("s", size($"numbers"))
df3.show()

does return 0 - as expected.

Looking for the correct approach here so as to get size = 0.

mck · Accepted Answer

I suppose the root cause is that split returns an empty string, instead of a null.

scala> df1.withColumn("cola", split(trim($"col2", "[]"), ",")).withColumn("s", $"cola"(0)).select("s").collect()(1)(0)
res53: Any = ""

And the size of an array containing an empty string is, of course, 1.

To get around this, perhaps you could do

val df2 = df1.withColumn("cola", split(trim($"col2", "[]"), ","))
             .withColumn("s", when(length($"cola"(0)) =!= 0, size($"cola"))
                              .otherwise(lit(0)))

df2.show(false)
+----+-------------+---------------+---+
|col1|col2         |cola           |s  |
+----+-------------+---------------+---+
|1   |[{a},{b},{c}]|[{a}, {b}, {c}]|3  |
|2   |[]           |[]             |0  |
|3   |[{d},{e},{f}]|[{d}, {e}, {f}]|3  |
+----+-------------+---------------+---+

size function applied to empty array column in dataframe returns 1 after spilt

Answers (2)

Related Questions