Reputation: 42093
I am using Python 3.9 to recurse through a directory of files and send the files to a multiprocessing queue. The directory has 10m+ files and it can take up to 20 minutes to build and process the initial file list. How could I improve this? Perhaps there is a way to recurse through the files without loading them into memory first?
path = "/directory-of-files"
def is_valid_file():
#returns true if file meets conditions
return True
files = glob.glob(path + '/**', recursive=True)
files = filter(is_valid_file, files) #Filter valid files only
results = pool.map(setMedia, files)
Upvotes: 3
Views: 3100
Reputation: 10882
Perhaps there is a way to recurse through the files without loading them into memory first?
glob.igob
returns an iterator which yields the same values as glob()
without actually storing them all simultaneously.
If that doesn't help, perhaps the bottleneck is not in pattern matching or list building, but simply in traversing all the files. You can write a simple recursive tree traversal using os.listdir
and see how much time it takes to traverse your directory tree (without matching files).
Upvotes: 1