Reputation: 687
CSVFileFormat
seems to read and write empty values as null for string columns. I have searched around but have been unable to find clear information about this, so I put together a simple test.
val df = session.createDataFrame(Seq(
(0, "a"),
(1, "b"),
(2, "c"),
(3, ""),
(4, null)
))
df.coalesce(1).write.mode("overwrite").format("csv")
.option("delimiter", ",")
.option("nullValue", "unknown")
.option("treatEmptyValuesAsNulls", "false")
.save(s"$path/test")
This outputs:
0,a
1,b
2,c
3,unknown
4,unknown
So, it appears to be treating both empty strings and null
values as null
. The same thing happens when reading a CSV file with empty quoted strings and nulls. Is there currently any way to treat these differently?
Upvotes: 18
Views: 27065
Reputation: 341
I agree this behavior is a great improvement but BE WARNED if you use Sqoop to export CSV files in production that are created with Spark and you update without changing
.option("nullValue", "null")
then your Sqoop Export will fail, and it will likely throw you an error that provides utterly no insight into the root cause of this. 100% a Sqoop problem, and if at all possible I highly urge everyone to take advantage of Spark JDBC support for exports, but be warned and aware. Lots of enterprises out there still rely on Sqoop heavily.
Upvotes: 0
Reputation: 2866
A mere two and a half years later, empty strings are no longer considered equal to null
values thanks to Spark 2.4.0! See this commit for a bit of detail on functionality. Your code will behave as expected under 2.4.0+:
val df = session.createDataFrame(Seq(
(0, "a"),
(1, "b"),
(2, "c"),
(3, ""),
(4, null)
))
df.coalesce(1).write.mode("overwrite").format("csv")
.option("delimiter", ",")
.option("nullValue", "unknown")
.option("treatEmptyValuesAsNulls", "false")
.save(s"$path/test")
Results in:
0,a
1,b
2,c
3,
4,unknown
Upvotes: 10