Reputation: 74
My application is trying to list all files from an NFS drive based on a regex pattern. I am using the apache commons-io fileutils to list the files.
I am successfully able to do so, although the problem I am facing is that the return object size is really high (as there are millions of files). Heap memory consumption goes upto 10GiB and causes OOM on many occasions.
Is there a way to limit the size of the return collection? (Like getting only N number of results, e.g. SELECT queries can be limited to a certain number in SQL)
I am fine with getting partial result, I can loop through the list method. I will be deleting files older than a certain date so I know the total result set will keep getting smaller after every iteration.
Upvotes: 0
Views: 70
Reputation: 103273
Ditch apache commons. You don't need it. Everything it does is either fundamentally a flawed approach, or has been superseded by core library functionality, or is done better by more recent alternatives such as guava.
To tackle the problem of walking through humongous directory trees recursively, make a file walker using the new (well, I say new, it's about a decade old at this point?) java core lib file API: Files.walk
.
This doesn't take any memory; it will simply invoke your code for every entry. For any given dir you can choose whether to 'enter' it (recurse into the subdir) or not. Even if you have billions of files, this won't run out of memory, it'll simply take a while to dig through invoking you millions of times.
If you need to do lookups as part of the process, use the walker to pump all data into a database, add appropriate indexes, and then run your queries there. Or re-engineer your disk setup so that all lookups can be done on name only. I have to be a bit vague as the question doesn't specify the specific setup, but, for example, if you have to find a file with a given git hash, instead of storing all git patches in one ginormous dir, make subdirs that match the start of the hash - the top level dir has 256 directories (From 00
to ff
), and if need be each such dir itself has 256 subdirs, and eventually e.g. f5/9a/f59a123456789abcdef.patch
exists which is the file you wanted. Now finding a file given the first 6 characters of its hash can be done quickly.
Upvotes: 1