How to optimize search with list dir and path walk?

Question

Python 2.7.5 Win/Mac.

I'm trying to find the best way to search files (over 10000) on multiple storages (about 128Tio). These files have specific extensions and I can ignore some folders.

Here is my first function with os.listdir and recursion :

count = 0
def SearchFiles1(path):
    global count
    pathList = os.listdir(path)
    for i in pathList:
        subPath = path+os.path.sep+i
        if os.path.isfile(subPath) == True :
            fileName = os.path.basename(subPath)
            extension = fileName[fileName.rfind("."):]
            if ".ext1" in extension or ".ext2" in extension or ".ext3" in extension:
                count += 1
                #do stuff . . .
        else :
            if os.path.isdir(subPath) == True:
                if not "UselessFolder1" in subPath and not "UselessFolder1" in subPath:
                    SearchFiles1(subPath)

It works, but i think it could be better (faster and proper) or am I wrong?

So I tried os.path.walk:

def SearchFiles2(path):
    count = 0
    for dirpath, subdirs, files in os.walk(path):
        for i in dirpath:
            if not "UselessFolder1" in i and not "UselessFolder1" in i:
                for y in files:
                    fileName = os.path.basename(y)
                    extension = fileName[fileName.rfind("."):]
                    if ".ext2" in extension or ".ext2" in extension or ".ext3" in extension:
                        count += 1
                        # do stuff . . .
    return count

"count" is wrong and a way slower. And I think I don't really understand how path.walk works.

My question is: what can I do to optimize this research?

tdelaney · Accepted Answer

Your first solution is reasonable except that you could use os.path.splitext. In the second solution, its incorrect because you revisit the files list for each subdir instead of just processing it once. With os.path.walk the trick is that directories removed from subdirs are not part of the next round of enumerations.

def SearchFiles2(path):
    useless_dirs = set(("UselessFolder1", "UselessFolder2"))
    useless_files = set((".ext1", ".ext2"))
    count = 0
    for dirpath, subdirs, files in os.walk(path):
        # remove unwanted subdirs from future enumeration
        for name in set(subdirs) & useless_dir:
            subdirs.remove(name)
        # list of interesting files
        myfiles = [os.path.join(dirpath, name) for name in files
            if os.path.splitext(name)[1] not in useless_files]
        count += len(myfiles)
        for filepath in myfiles:
            # example shows file stats
            print(filepath, os.stat(filepath)
    return count

Enumerating the file system of a single storage unit can only go so fast. The best way to speed this up is to run the enumeration of different storage units in different threads.

How to optimize search with list dir and path walk?

Answers (2)

Related Questions