Syrius
Syrius

Reputation: 25

How to optimize search with list dir and path walk?

Python 2.7.5 Win/Mac.

I'm trying to find the best way to search files (over 10000) on multiple storages (about 128Tio). These files have specific extensions and I can ignore some folders.

Here is my first function with os.listdir and recursion :

count = 0
def SearchFiles1(path):
    global count
    pathList = os.listdir(path)
    for i in pathList:
        subPath = path+os.path.sep+i
        if os.path.isfile(subPath) == True :
            fileName = os.path.basename(subPath)
            extension = fileName[fileName.rfind("."):]
            if ".ext1" in extension or ".ext2" in extension or ".ext3" in extension:
                count += 1
                #do stuff . . .
        else :
            if os.path.isdir(subPath) == True:
                if not "UselessFolder1" in subPath and not "UselessFolder1" in subPath:
                    SearchFiles1(subPath)

It works, but i think it could be better (faster and proper) or am I wrong?

So I tried os.path.walk:

def SearchFiles2(path):
    count = 0
    for dirpath, subdirs, files in os.walk(path):
        for i in dirpath:
            if not "UselessFolder1" in i and not "UselessFolder1" in i:
                for y in files:
                    fileName = os.path.basename(y)
                    extension = fileName[fileName.rfind("."):]
                    if ".ext2" in extension or ".ext2" in extension or ".ext3" in extension:
                        count += 1
                        # do stuff . . .
    return count

"count" is wrong and a way slower. And I think I don't really understand how path.walk works.

My question is: what can I do to optimize this research?

Upvotes: 0

Views: 334

Answers (2)

Syrius
Syrius

Reputation: 25

So after tests and discussion with tdelaney I optimized both solutions as follow:

import os

count = 0
target_files = set((".ext1", ".ext2", ".ext3")) # etc
useless_dirs = set(("UselessFolder2", "UselessFolder2")) # etc
# it could be target_dirs, just change `in` with `not in` when compared.

def SearchFiles1(path):
    global count
    pathList = os.listdir(path)
    for content in pathList:
        fullPath = os.path.join(path,content)
        if os.path.isfile(fullPath):
            if os.path.splitext(fullPath)[1] in target_files:
                count += 1
                #do stuff with 'fullPath' . . .
        else :
            if os.path.isdir(fullPath):
                if fullPath not in useless_dirs:
                    SearchFiles1(fullPath)

def SearchFiles2(path):
    count = 0
    for dirpath, subdirs, files in os.walk(path):
        for name in set(subdirs) & useless_dirs:
            subdirs.remove(name)
        for filename in [name for name in files if os.path.splitext(name)[1] in target_files]:
            count += 1
            fullPath = os.path.join(dirpath, filename)
            #do stuff with 'fullPath' . . .
    return count

It works fine on Mac/PC v2.7.5

About speed it's totally even.

Upvotes: 0

tdelaney
tdelaney

Reputation: 77347

Your first solution is reasonable except that you could use os.path.splitext. In the second solution, its incorrect because you revisit the files list for each subdir instead of just processing it once. With os.path.walk the trick is that directories removed from subdirs are not part of the next round of enumerations.

def SearchFiles2(path):
    useless_dirs = set(("UselessFolder1", "UselessFolder2"))
    useless_files = set((".ext1", ".ext2"))
    count = 0
    for dirpath, subdirs, files in os.walk(path):
        # remove unwanted subdirs from future enumeration
        for name in set(subdirs) & useless_dir:
            subdirs.remove(name)
        # list of interesting files
        myfiles = [os.path.join(dirpath, name) for name in files
            if os.path.splitext(name)[1] not in useless_files]
        count += len(myfiles)
        for filepath in myfiles:
            # example shows file stats
            print(filepath, os.stat(filepath)
    return count

Enumerating the file system of a single storage unit can only go so fast. The best way to speed this up is to run the enumeration of different storage units in different threads.

Upvotes: 1

Related Questions