Reputation: 63042
Until recently parquet
did not support null
values - a questionable premise. In fact a recent version did finally add that support:
https://github.com/apache/parquet-format/blob/master/LogicalTypes.md
However it will be a long time before spark
supports that new parquet
feature - if ever. Here is the associated (closed - will not fix
) JIRA:
https://issues.apache.org/jira/browse/SPARK-10943
So what are folks doing with regards to null column values today when writing out dataframe
's to parquet
? I can only think of very ugly horrible hacks like writing empty strings and .. well .. I have no idea what to do with numerical values to indicate null
- short of putting some sentinel value in and having my code check for it (which is inconvenient and bug prone).
Upvotes: 34
Views: 77722
Reputation: 434
I wrote a PySpark solution for this (df
is a dataframe with columns of NullType
):
# get dataframe schema
my_schema = list(df.schema)
null_cols = []
# iterate over schema list to filter for NullType columns
for st in my_schema:
if str(st.dataType) == 'NullType':
null_cols.append(st)
# cast null type columns to string (or whatever you'd like)
for ncol in null_cols:
mycolname = str(ncol.name)
df = df \
.withColumn(mycolname, df[mycolname].cast('string'))
Upvotes: 15
Reputation: 35229
You misinterpreted SPARK-10943. Spark does support writing null
values to numeric columns.
The problem is that null
alone carries no type information at all
scala> spark.sql("SELECT null as comments").printSchema
root
|-- comments: null (nullable = true)
As per comment by Michael Armbrust all you have to do is cast:
scala> spark.sql("""SELECT CAST(null as DOUBLE) AS comments""").printSchema
root
|-- comments: double (nullable = true)
and the result can be safely written to Parquet.
Upvotes: 41