Spark: create new column from another but not nullable

Question

I have a simple problem, but can find an easy solution.

I noticed the following:

myDF.withColumn("newColumn", col("aNullableColumn"))

Then in the schema the newColumn is becoming nullable, even if there are no null values in aNullableColumn.

How to get newColumn to be not nullable?

I googled a little bit, the only solution I found is to rewritte the schema and recreate the dataframe, but this isn't nice solution.

David Vrba · Accepted Answer

If you are absolutely sure that your column has no null values, you can do this to change the nullability property of your new column:

from pyspark.sql.functions import col, lit, coalesce

myDF.withColumn("newColumn", coalesce(col("aNullableColumn"), lit(0)))

And make sure to use correct data type inside the lit function (the same data type as is your aNullableColumn). Also be aware that if there is null value, the coalesce function will change it to the value you provide inside lit.

The reason why this works is the way how coalesce deals with nullable property. This is taken directly from Spark source code:

Coalesce is nullable if all of its children are nullable, or if it has no children.

Here the second child is lit(0) and this is not nullable therefore the resulting column will not be nullable either.

Spark: create new column from another but not nullable

Answers (1)

Related Questions