Is there any problems with saving parquet as a single file and no directory

Question

I am currently working on a Pyspark application to output daily delta extracts as parquet. These files are to be a single partition (the natural partition will be on the date the data is created/updated, which is how they are being built).

I was planning to then take the outputted parquet folder and files, rename the actual parquet file itself, move it to another location and discard the original *.parquet directory including its _SUCCESS and *.crc files.

While I have tested reading files produced using the above scenario with Spark and Pandas, I am unsure whether this will cause issues with other applications that we may introduce in the future.

Can anyone see any actual issue (apart from the processing/coding effort) with the above approach?

Thanks

notNull · Accepted Answer

If you are having one parquet file and renaming that file to new filename then new file will be a valid parquet file.

If you are combining one or more parquet files and combining them to one then the combined file will not be a valid parquet file.

In case you are combining more parquet files into one then its better to create one file by using spark (using repartition) and write to the table.

(or)
You can also use parquet-tools-**.jar to merge multiple parquet files into one parquet file.

Is there any problems with saving parquet as a single file and no directory

Answers (1)

Related Questions