Buggy
Buggy

Reputation: 119

Why can't I merge multiple parquet files using "cat file1.parquet file2. parquet > result.parquet"?

I have created multiple parquet files using pyspark and now I'm trying to merge all the parquet files to 1. I'm able to merge the files, but while reading in the resulting file, I'm getting an error. Have anyone faced this issue before?

Upvotes: 4

Views: 1081

Answers (1)

Uwe L. Korn
Uwe L. Korn

Reputation: 8796

You cannot simply concatenate Parquet files using cat as they are binary files with a "table of contents" in the footer. To merge two files, you would have to read them both in and write out a completely new file. This could be done easily using the merge command in the parquet-tools.

The technical background that merging two Parquet files using cat isn't working comes down to the fact that a Parquet file is useless without a footer. Every Parquet file is made up roughly by the following structure:

RowGroup(nrows=..)
  Column with nrows
  Column with nrows
  ..
RowGroup(nrows=..)
  ..
..
Footer
  Schema (tells you the type of the columns)
  total_nrows
  Location of RowGroups in the file

If you cat two files together, you would only see the last footer of the two files. Additionally, if the library used to read the files does some integrity checks, it will realise that your file is broken in some fashion and completely error out.

Upvotes: 2

Related Questions