Ash
Ash

Reputation: 410

Is there any problems with saving parquet as a single file and no directory

I am currently working on a Pyspark application to output daily delta extracts as parquet. These files are to be a single partition (the natural partition will be on the date the data is created/updated, which is how they are being built).

I was planning to then take the outputted parquet folder and files, rename the actual parquet file itself, move it to another location and discard the original *.parquet directory including its _SUCCESS and *.crc files.

While I have tested reading files produced using the above scenario with Spark and Pandas, I am unsure whether this will cause issues with other applications that we may introduce in the future.

Can anyone see any actual issue (apart from the processing/coding effort) with the above approach?

Thanks

Upvotes: 0

Views: 2215

Answers (1)

notNull
notNull

Reputation: 31540

If you are having one parquet file and renaming that file to new filename then new file will be a valid parquet file.

If you are combining one or more parquet files and combining them to one then the combined file will not be a valid parquet file.

  • In case you are combining more parquet files into one then its better to create one file by using spark (using repartition) and write to the table.

    (or)

  • You can also use parquet-tools-**.jar to merge multiple parquet files into one parquet file.

Upvotes: 2

Related Questions