yoel
yoel

Reputation: 249

In DataFrame.withColumn, how can I check if the column's value is null as a condition for the second parameter?

If I have a DataFrame called df that looks like:

+----+----+
|  a1+  a2|
+----+----+
| foo| bar|
| N/A| baz|
|null| etc|
+----+----+

I can selectively replace values like so:

val df2 = df.withColumn("a1", when($"a1" === "N/A", $"a2"))

so that df2 looks like:

+----+----+
|  a1+  a2|
+----+----+
| foo| bar|
| baz| baz|
|null| etc|
+----+----+

but why can't I check if it's null, like:

val df3 = df2.withColumn("a1", when($"a1" === null, $"a2"))

so that I get:

+----+----+
|  a1+  a2|
+----+----+
| foo| bar|
| baz| baz|
| etc| etc|
+----+----+

Edit: $"a1".isNull doesn't seem to work. Could it be because of how I'm constructing the dataframe I'm using to test, which is as follows?

val schema = StructType(
                StructField("a1", StringType, false) ::
                StructField("a2", StringType, false) :: Nil
)

val data = sc.parallelize(Array(
                Row("foo","bar"),
                Row("N/A","baz"),
                Row(null,"etc"))
)

val df = sqlContext.createDataFrame(data, schema)

I also can't use coalesce, as far as I know, because sometimes I need to use a static value instead of another column's value.

Edit again: Setting my test columns to nullable = false doesn't help.

Upvotes: 4

Views: 11846

Answers (1)

T. Gawęda
T. Gawęda

Reputation: 16086

Because null means no value and shouldn't be checked like this.

Use isNull function:

val df3 = df2.withColumn("a1", when($"a1".isNull, $"a2"))

or coalesce, which returns first non-null value:

val df3 = df2.withColumn("a1", coalesce($"a1", $"a2"))

Upvotes: 8

Related Questions