membersound
membersound

Reputation: 86855

Why does Files.deleteIfExists take so long for large files?

On a large file (here 35GB):

Files.deleteIfExists(Path.get("large.csv"));

The deletion using java takes >60s. Deleting with rm large.csv on the console just a moment.

Why? Can I speed up large file deletion from within java?

Upvotes: 0

Views: 1829

Answers (1)

Stephen C
Stephen C

Reputation: 719299

I would blame this on the operating system. On both Windows and Linux, Java simply calls a method provided by the OS-provided C native runtime libraries to delete the file.

(Check the OpenJDK source code.)


So why might it take a long time for the operating system to delete a large file?

  • A typical file system keeps a map of the disk blocks that are free versus in-use. If you are freeing a really large file, a large number of blocks are being freed, so a large number of bits in the free map need to be updated and written to disk.

  • A typical file system uses a tree-based index structure to map file offsets to disk blocks. If a file is large enough, the index structure may span multiple disk blocks. When a file is deleted, the entire index needs to be scanned to figure all of the blocks containing data that need to be freed.

  • These costs are magnified if the file is badly fragmented, and the index blocks and free map blocks are widely scattered.

  • Deleting a file is typically done synchronously. At least, all of the disk blocks are marked as free before the syscall returns. (If you don't do that, the user is liable to complain that deleting files doesn't work.)

In short, when you delete a huge file, there is a lot of "disk" I/O to do. The operating system does this, not Java.


So why would deleting a file be faster from the command line?

One possible reason is that maybe the rm command you using is actually just moving the deleted file to a Trash folder. That is actually a rename operation, and it is much faster than a real delete.

Note: that's not the normal behavior of rm on Linux.

Another possible reason (on Linux) is that the index and free map blocks for the file that you were deleting were in the buffer cache in one test scenario and not in the other. (If your machine has lost of spare RAM, the Linux OS will cache disk blocks in RAM to improve performance. It is pretty effect.)

Upvotes: 6

Related Questions