Reputation: 973
I am trying to save filtered dataframe back to the same source file.
I wrote below code to transform the content of each file in a directory to separate Dataframe, filter it and save it back to the same file
rdd = sparkSession.sparkContext.wholeTextFiles("/content/sample_data/test_data")
# collect the RDD to a list
list_elements = rdd.collect()
for element in list_elements:
path, data = element
df = spark.read.json(spark.sparkContext.parallelize([data]))
df = df.filter('d != 721')
df.write.save(path, format="json", mode="overwrite")
I was expecting that it will overwrite the file with the updated data, but it is creating a folder with the file name and creating below structure and part files:
How can I save each updated dataframe back to the same source file(.txt)? Thanks in Advance.
Upvotes: 0
Views: 674
Reputation: 781
To save it to 1 file use .coalesce(1)
or .repartition(1)
option before .save()
, that will result in the same folder-like structure, but there will be 1 json file inside.
To save it with a „normal” name after saving it you’d need to cut the 1 json file inside, paste and rename it with desired name. You can see code how it could look like for csv files here
Upvotes: 0