Reputation: 410
I am currently working on a Pyspark application to output daily delta extracts as parquet. These files are to be a single partition (the natural partition will be on the date the data is created/updated, which is how they are being built).
I was planning to then take the outputted parquet folder and files, rename the actual parquet file itself, move it to another location and discard the original *.parquet
directory including its _SUCCESS
and *.crc
files.
While I have tested reading files produced using the above scenario with Spark and Pandas, I am unsure whether this will cause issues with other applications that we may introduce in the future.
Can anyone see any actual issue (apart from the processing/coding effort) with the above approach?
Thanks
Upvotes: 0
Views: 2215
Reputation: 31540
If you are having one parquet file
and renaming that file to new filename
then new file will be a valid parquet file
.
If you are combining one or more parquet files
and combining them to one
then the combined file will not be a valid parquet file
.
In case you are combining more parquet files
into one then its better to create one file by using spark (using repartition) and write to the table.
(or)
You can also use parquet-tools-**.jar
to merge multiple parquet files into one parquet file.
Upvotes: 2