Reputation: 119
I have created multiple parquet files using pyspark and now I'm trying to merge all the parquet files to 1. I'm able to merge the files, but while reading in the resulting file, I'm getting an error. Have anyone faced this issue before?
Upvotes: 4
Views: 1081
Reputation: 8796
You cannot simply concatenate Parquet files using cat
as they are binary files with a "table of contents" in the footer. To merge two files, you would have to read them both in and write out a completely new file. This could be done easily using the merge
command in the parquet-tools
.
The technical background that merging two Parquet files using cat
isn't working comes down to the fact that a Parquet file is useless without a footer. Every Parquet file is made up roughly by the following structure:
RowGroup(nrows=..)
Column with nrows
Column with nrows
..
RowGroup(nrows=..)
..
..
Footer
Schema (tells you the type of the columns)
total_nrows
Location of RowGroups in the file
If you cat
two files together, you would only see the last footer of the two files. Additionally, if the library used to read the files does some integrity checks, it will realise that your file is broken in some fashion and completely error out.
Upvotes: 2