Overwrite a Parquet file with Pyspark

Question

I am trying to overwrite a Parquet file in S3 with Pyspark. Versioning is enabled for the bucket.

I am using the following code:

Write v1:

df_v1.repartition(1).write.parquet(path='s3a://bucket/file1.parquet')

Update v2:

df_v1 = spark.read.parquet("s3a://bucket/file1.parquet")
df_v2 = df_v1.... <- transform
df_v2.repartition(1).write.mode("overwrite").parquet('s3a://bucket/file1.parquet')

But when I read df_v2 it contains data from both writes. Furthermore when df_v1 is written I can see one part-xxx.snappy.parquet file, after writing df_v2 I can see two. It behaves as an append rather than overwrite.

What am I missing ? Thanks

Spark = 2.4.4 Hadoop = 2.7.3

Overwrite a Parquet file with Pyspark

Answers (1)

Related Questions