Shinagan
Shinagan

Reputation: 445

"Parquet record is malformed" while column count is not 0

On an AWS EMR cluster, I'm trying to write a query result to parquet using Pyspark but face the following error:

Caused by: java.lang.RuntimeException: Parquet record is malformed: empty fields are illegal, the field should be ommited completely instead
    at org.apache.hadoop.hive.ql.io.parquet.write.DataWritableWriter.write(DataWritableWriter.java:64)
    at org.apache.hadoop.hive.ql.io.parquet.write.DataWritableWriteSupport.write(DataWritableWriteSupport.java:59)
    at org.apache.hadoop.hive.ql.io.parquet.write.DataWritableWriteSupport.write(DataWritableWriteSupport.java:31)
    at parquet.hadoop.InternalParquetRecordWriter.write(InternalParquetRecordWriter.java:121)
    at parquet.hadoop.ParquetRecordWriter.write(ParquetRecordWriter.java:123)
    at parquet.hadoop.ParquetRecordWriter.write(ParquetRecordWriter.java:42)
    at org.apache.hadoop.hive.ql.io.parquet.write.ParquetRecordWriterWrapper.write(ParquetRecordWriterWrapper.java:111)
    at org.apache.hadoop.hive.ql.io.parquet.write.ParquetRecordWriterWrapper.write(ParquetRecordWriterWrapper.java:124)
    at org.apache.spark.sql.hive.execution.HiveOutputWriter.write(HiveFileFormat.scala:149)
    at org.apache.spark.sql.execution.datasources.SingleDirectoryDataWriter.write(FileFormatDataWriter.scala:137)
    at org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask$3.apply(FileFormatWriter.scala:245)
    at org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask$3.apply(FileFormatWriter.scala:242)
    at org.apache.spark.util.Utils$.tryWithSafeFinallyAndFailureCallbacks(Utils.scala:1394)
    at org.apache.spark.sql.execution.datasources.FileFormatWriter$.org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask(FileFormatWriter.scala:248)
    ... 10 more

I've read that this could happen if there were some columns with null values only, but after checking all column counts that is not the case. None of the columns is completely empty. Instead of using parquet, I tried to write the results to a text file and everything went smoothly.

Any clue what could trigger this error? Here are all the data types used in this table. There are 51 columns in total.

'array<bigint>',
'array<char(50)>',
'array<smallint>',
'array<string>',
'array<varchar(100)>',
'array<varchar(50)>',
'bigint',
'char(16)',
'char(20)',
'char(4)',
'int',
'string',
'timestamp',
'varchar(255)',
'varchar(50)',
'varchar(87)'

Upvotes: 7

Views: 9929

Answers (4)

Tuyen Luong
Tuyen Luong

Reputation: 1366

Using Spark 3 solves this problem.

Upvotes: 0

HagaiA
HagaiA

Reputation: 373

As Shinagan wrote, you can check if the array is empty and set it to Null.

You can do it by using the cardinality function:

case when cardinality(array_x) = 0 then null else array_x end

Upvotes: 0

It looks like you are using one of Spark's Hive write paths (org.apache.hadoop.hive.ql.io.parquet.write). I was able to work around this issue by instead writing directly to parquet, then later adding partitions to any Hive table needed.

df.write.parquet(your_path)
spark.sql(f"""
    ALTER TABLE {your_table}
    ADD PARTITION (partition_spec) LOCATION '{your_path}'
    """)

Upvotes: 0

Shinagan
Shinagan

Reputation: 445

Turns out Parquet does not support empty arrays. This error will be triggered if there is one or more empty arrays (of any type) in the table.

One workaround is to cast the empty arrays to NULL values.

Upvotes: 19

Related Questions