Reputation: 2456
I have a requirement to delete a folder at hdfs containing a large number of files say 1,000,000. And this is not a one time task, this is my daily requirement. Currently I am using the code below:
Configuration c=new Configuration();
FileSystem fs = FileSystem.get(c);
fs.delete(folder,true);
But the above is taking much more time approx 3 hours. Is there any way by which I can delete the entire folder very fast.
Upvotes: 2
Views: 2525
Reputation: 6172
Simple answer: you can't.
Let me explain why. When you are deleting a folder, you are removing all references to all files (recursively) contained in it. The metadata about these files (chunk locations) is retained in the namenode.
The data nodes store data chunks, but have basically no idea about the actual files it corresponds to. Although you could technically remove all references to a folder from the namenode (which would make the folder appear as deleted), the data would still remain on the datanodes, which would have no way of knowing that the data is "dead".
As such, when you delete a folder, you have to reclaim first reclaim all memory from all data chunks that are spread across the whole cluster for all files. This can take a significant amount of time, but is basically unavoidable.
You could simply process deletions in a background thread. Although this won't help with the lengthy process, this would at least hide this process from the application.
Upvotes: 2