ensnare
ensnare

Reputation: 42093

In python, how can I improve glob performance for a very large directory of files?

I am using Python 3.9 to recurse through a directory of files and send the files to a multiprocessing queue. The directory has 10m+ files and it can take up to 20 minutes to build and process the initial file list. How could I improve this? Perhaps there is a way to recurse through the files without loading them into memory first?

path = "/directory-of-files"
def is_valid_file():
   #returns true if file meets conditions
   return True

files = glob.glob(path + '/**', recursive=True)
files = filter(is_valid_file, files) #Filter valid files only
results = pool.map(setMedia, files)

Upvotes: 3

Views: 3100

Answers (1)

Aivean
Aivean

Reputation: 10882

Perhaps there is a way to recurse through the files without loading them into memory first?

glob.igob returns an iterator which yields the same values as glob() without actually storing them all simultaneously.

If that doesn't help, perhaps the bottleneck is not in pattern matching or list building, but simply in traversing all the files. You can write a simple recursive tree traversal using os.listdir and see how much time it takes to traverse your directory tree (without matching files).

Upvotes: 1

Related Questions