Pyspark- Save each dataframe to a single file

Question

I am trying to save filtered dataframe back to the same source file.

I wrote below code to transform the content of each file in a directory to separate Dataframe, filter it and save it back to the same file

rdd = sparkSession.sparkContext.wholeTextFiles("/content/sample_data/test_data")
# collect the RDD to a list
list_elements = rdd.collect()
for element in list_elements:
  path, data = element
  df = spark.read.json(spark.sparkContext.parallelize([data]))
  df = df.filter('d != 721')
  df.write.save(path, format="json", mode="overwrite")

I was expecting that it will overwrite the file with the updated data, but it is creating a folder with the file name and creating below structure and part files:

How can I save each updated dataframe back to the same source file(.txt)? Thanks in Advance.

Pyspark- Save each dataframe to a single file

Answers (1)

Related Questions