Reputation: 445
On an AWS EMR cluster, I'm trying to write a query result to parquet using Pyspark but face the following error:
Caused by: java.lang.RuntimeException: Parquet record is malformed: empty fields are illegal, the field should be ommited completely instead
at org.apache.hadoop.hive.ql.io.parquet.write.DataWritableWriter.write(DataWritableWriter.java:64)
at org.apache.hadoop.hive.ql.io.parquet.write.DataWritableWriteSupport.write(DataWritableWriteSupport.java:59)
at org.apache.hadoop.hive.ql.io.parquet.write.DataWritableWriteSupport.write(DataWritableWriteSupport.java:31)
at parquet.hadoop.InternalParquetRecordWriter.write(InternalParquetRecordWriter.java:121)
at parquet.hadoop.ParquetRecordWriter.write(ParquetRecordWriter.java:123)
at parquet.hadoop.ParquetRecordWriter.write(ParquetRecordWriter.java:42)
at org.apache.hadoop.hive.ql.io.parquet.write.ParquetRecordWriterWrapper.write(ParquetRecordWriterWrapper.java:111)
at org.apache.hadoop.hive.ql.io.parquet.write.ParquetRecordWriterWrapper.write(ParquetRecordWriterWrapper.java:124)
at org.apache.spark.sql.hive.execution.HiveOutputWriter.write(HiveFileFormat.scala:149)
at org.apache.spark.sql.execution.datasources.SingleDirectoryDataWriter.write(FileFormatDataWriter.scala:137)
at org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask$3.apply(FileFormatWriter.scala:245)
at org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask$3.apply(FileFormatWriter.scala:242)
at org.apache.spark.util.Utils$.tryWithSafeFinallyAndFailureCallbacks(Utils.scala:1394)
at org.apache.spark.sql.execution.datasources.FileFormatWriter$.org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask(FileFormatWriter.scala:248)
... 10 more
I've read that this could happen if there were some columns with null values only, but after checking all column counts that is not the case. None of the columns is completely empty. Instead of using parquet, I tried to write the results to a text file and everything went smoothly.
Any clue what could trigger this error? Here are all the data types used in this table. There are 51 columns in total.
'array<bigint>',
'array<char(50)>',
'array<smallint>',
'array<string>',
'array<varchar(100)>',
'array<varchar(50)>',
'bigint',
'char(16)',
'char(20)',
'char(4)',
'int',
'string',
'timestamp',
'varchar(255)',
'varchar(50)',
'varchar(87)'
Upvotes: 7
Views: 9929
Reputation: 373
As Shinagan wrote, you can check if the array is empty and set it to Null.
You can do it by using the cardinality
function:
case when cardinality(array_x) = 0 then null else array_x end
Upvotes: 0
Reputation: 818
It looks like you are using one of Spark's Hive write paths (org.apache.hadoop.hive.ql.io.parquet.write
). I was able to work around this issue by instead writing directly to parquet, then later adding partitions to any Hive table needed.
df.write.parquet(your_path)
spark.sql(f"""
ALTER TABLE {your_table}
ADD PARTITION (partition_spec) LOCATION '{your_path}'
""")
Upvotes: 0
Reputation: 445
Turns out Parquet does not support empty arrays. This error will be triggered if there is one or more empty arrays (of any type) in the table.
One workaround is to cast the empty arrays to NULL values.
Upvotes: 19