Phil
Phil

Reputation: 648

Overwrite a Parquet file with Pyspark

I am trying to overwrite a Parquet file in S3 with Pyspark. Versioning is enabled for the bucket.

I am using the following code:

Write v1:

df_v1.repartition(1).write.parquet(path='s3a://bucket/file1.parquet')

Update v2:

df_v1 = spark.read.parquet("s3a://bucket/file1.parquet")
df_v2 = df_v1.... <- transform
df_v2.repartition(1).write.mode("overwrite").parquet('s3a://bucket/file1.parquet')

But when I read df_v2 it contains data from both writes. Furthermore when df_v1 is written I can see one part-xxx.snappy.parquet file, after writing df_v2 I can see two. It behaves as an append rather than overwrite.

What am I missing ? Thanks

Spark = 2.4.4 Hadoop = 2.7.3

Upvotes: 3

Views: 5431

Answers (1)

Steven
Steven

Reputation: 15283

The problem comes probably from the fact that you are using S3. in S3, the file system is key/value based, which means that there is no physical folder named file1.parquet, there are only files whose keys are something like s3a://bucket/file1.parquet/part-XXXXX-b1e8fd43-ff42-46b4-a74c-9186713c26c6-c000.parquet (that's just an example).

So when you "overwrite", you are supposed to overwrite the folder, which cannot be detected. Therefore, spark creates new keys: it is like an "append" mode.

You probably need to write your own function that overwrite the "folder" - delete all the keys that contains the folder in their name.

Upvotes: 3

Related Questions