Reputation: 648
I am trying to overwrite a Parquet file in S3 with Pyspark. Versioning is enabled for the bucket.
I am using the following code:
Write v1:
df_v1.repartition(1).write.parquet(path='s3a://bucket/file1.parquet')
Update v2:
df_v1 = spark.read.parquet("s3a://bucket/file1.parquet")
df_v2 = df_v1.... <- transform
df_v2.repartition(1).write.mode("overwrite").parquet('s3a://bucket/file1.parquet')
But when I read df_v2 it contains data from both writes. Furthermore when df_v1 is written I can see one part-xxx.snappy.parquet file, after writing df_v2 I can see two. It behaves as an append rather than overwrite.
What am I missing ? Thanks
Spark = 2.4.4 Hadoop = 2.7.3
Upvotes: 3
Views: 5431
Reputation: 15283
The problem comes probably from the fact that you are using S3.
in S3, the file system is key/value based, which means that there is no physical folder named file1.parquet
, there are only files whose keys are something like s3a://bucket/file1.parquet/part-XXXXX-b1e8fd43-ff42-46b4-a74c-9186713c26c6-c000.parquet
(that's just an example).
So when you "overwrite", you are supposed to overwrite the folder, which cannot be detected. Therefore, spark creates new keys: it is like an "append" mode.
You probably need to write your own function that overwrite the "folder" - delete all the keys that contains the folder in their name.
Upvotes: 3