Kyro
Kyro

Reputation: 687

Read and write empty string "" vs NULL in Spark 2.0.1

CSVFileFormat seems to read and write empty values as null for string columns. I have searched around but have been unable to find clear information about this, so I put together a simple test.

val df = session.createDataFrame(Seq(
    (0, "a"),
    (1, "b"),
    (2, "c"),
    (3, ""),
    (4, null)
))  

df.coalesce(1).write.mode("overwrite").format("csv") 
    .option("delimiter", ",")
    .option("nullValue", "unknown")
    .option("treatEmptyValuesAsNulls", "false")
    .save(s"$path/test")

This outputs:

0,a
1,b
2,c
3,unknown
4,unknown

So, it appears to be treating both empty strings and null values as null. The same thing happens when reading a CSV file with empty quoted strings and nulls. Is there currently any way to treat these differently?

Upvotes: 18

Views: 27065

Answers (2)

Gabe Church
Gabe Church

Reputation: 341

I agree this behavior is a great improvement but BE WARNED if you use Sqoop to export CSV files in production that are created with Spark and you update without changing

.option("nullValue", "null")

then your Sqoop Export will fail, and it will likely throw you an error that provides utterly no insight into the root cause of this. 100% a Sqoop problem, and if at all possible I highly urge everyone to take advantage of Spark JDBC support for exports, but be warned and aware. Lots of enterprises out there still rely on Sqoop heavily.

Upvotes: 0

bsplosion
bsplosion

Reputation: 2866

A mere two and a half years later, empty strings are no longer considered equal to null values thanks to Spark 2.4.0! See this commit for a bit of detail on functionality. Your code will behave as expected under 2.4.0+:

val df = session.createDataFrame(Seq(
    (0, "a"),
    (1, "b"),
    (2, "c"),
    (3, ""),
    (4, null)
))  

df.coalesce(1).write.mode("overwrite").format("csv") 
    .option("delimiter", ",")
    .option("nullValue", "unknown")
    .option("treatEmptyValuesAsNulls", "false")
    .save(s"$path/test")

Results in:

0,a
1,b
2,c
3,
4,unknown

Upvotes: 10

Related Questions