Just Works
Just Works

Reputation: 35

Search a very large directory for a file containing text in it's name

I have a network share that contains around 300,000 files on it and it's constantly changing (files added and removed). I want to search the directory for specific text to find certain files within this directory. I have trimmed my method down about as far as I can, but it still takes over 6 minutes to complete. I can probably do it manually around the same time, depending on the number of strings I'm searching for. I want to multithread or multiprocess it, but I'm uncertain how this can be done on a single call: i.e.,

for filename in os.scandir(sourcedir).

Can anyone please help me figure this out?

def scan(sourcedir:str, oset:set[str]|str) -> set[str]:
    found = set()
        for filename in os.scandir(sourcedir):
            for ordr in oset:
                if ordr in filename.name:
                    print(filename.name)
                    found.add(filename.name)
                    break

RESULTS FROM A TYPICAL CALL: 516 function calls in 395.033 seconds

Ordered by: standard name

ncalls tottime percall cumtime percall filename:lineno(function) 6 0.000 0.000 0.003 0.000 :39(isdir) 6 0.000 0.000 1.346 0.224 :94(samefile) 12 0.000 0.000 0.001 0.000 :103(join) 30 0.000 0.000 0.000 0.000 :150(splitdrive) 6 0.000 0.000 0.000 0.000 :206(split) 6 0.000 0.000 0.000 0.000 :240(basename) 6 0.000 0.000 0.000 0.000 :35(_get_bothseps) 1 0.000 0.000 0.000 0.000 :545(normpath) 1 0.000 0.000 0.000 0.000 :577(abspath) 1 0.000 0.000 395.033 395.033 :1() 1 0.000 0.000 395.033 395.033 CopyOrders.py:31(main) 1 389.826 389.826 389.976 389.976 CopyOrders.py:67(scan) 1 0.000 0.000 5.056 5.056 CopyOrders.py:88(copy) 1 0.000 0.000 0.000 0.000 getopt.py:56(getopt) 6 0.000 0.000 0.001 0.000 shutil.py:170(_copyfileobj_readinto) 6 0.000 0.000 1.346 0.224 shutil.py:202(_samefile) 18 0.000 0.000 1.493 0.083 shutil.py:220(_stat) 6 0.001 0.000 4.295 0.716 shutil.py:226(copyfile) 6 0.000 0.000 0.756 0.126 shutil.py:290(copymode) 6 0.000 0.000 5.054 0.842 shutil.py:405(copy) 6 0.000 0.000 0.000 0.000 {built-in method _stat.S_IMODE} 6 0.000 0.000 0.000 0.000 {built-in method _stat.S_ISDIR} 6 0.000 0.000 0.000 0.000 {built-in method _stat.S_ISFIFO} 1 0.000 0.000 395.033 395.033 {built-in method builtins.exec} 6 0.000 0.000 0.000 0.000 {built-in method builtins.hasattr} 73 0.000 0.000 0.000 0.000 {built-in method builtins.isinstance} 38 0.000 0.000 0.000 0.000 {built-in method builtins.len} 6 0.000 0.000 0.000 0.000 {built-in method builtins.min} 14 0.003 0.000 0.003 0.000 {built-in method builtins.print} 12 2.180 0.182 2.180 0.182 {built-in method io.open} 1 0.000 0.000 0.000 0.000 {built-in method nt._getfullpathname} 1 0.000 0.000 0.000 0.000 {built-in method nt._path_normpath} 6 0.012 0.002 0.012 0.002 {built-in method nt.chmod} 49 0.000 0.000 0.000 0.000 {built-in method nt.fspath} 1 0.149 0.149 0.149 0.149 {built-in method nt.scandir} 36 2.841 0.079 2.841 0.079 {built-in method nt.stat} 12 0.000 0.000 0.000 0.000 {built-in method sys.audit} 12 0.019 0.002 0.019 0.002 {method 'exit' of '_io._IOBase' objects} 6 0.000 0.000 0.000 0.000 {method 'exit' of 'memoryview' objects} 6 0.000 0.000 0.000 0.000 {method 'add' of 'set' objects} 1 0.000 0.000 0.000 0.000 {method 'disable' of '_lsprof.Profiler' objects} 36 0.000 0.000 0.000 0.000 {method 'find' of 'str' objects} 12 0.001 0.000 0.001 0.000 {method 'readinto' of '_io.BufferedReader' objects} 30 0.000 0.000 0.000 0.000 {method 'replace' of 'str' objects} 6 0.000 0.000 0.000 0.000 {method 'rstrip' of 'str' objects} 6 0.000 0.000 0.000 0.000 {method 'write' of '_io.BufferedWriter' objects}

Upvotes: 3

Views: 125

Answers (3)

Just Works
Just Works

Reputation: 35

I ended up finding that no matter how many files I scan for, it doesn't take more than a shorter list of files (by much). So I think that the long period of time that it was taking to gather the list of existing files to compare against is akin to indexing the directory. I am using the tool for larger sets of files. For the onsies and twosies, I search manually. I suppose it is what it is.

Upvotes: 0

blhsing
blhsing

Reputation: 107050

Since you're only interested in the file names and not any of the other file attributes, you should not use os.scandir to incur the overhead of building objects with all the file attributes. Use os.listdir instead to retrieve just a list of file names.

Secondly, you can use a regex of an alternation pattern instead to more efficiently search for multiple substrings since the re module is written in the much faster C language.

import re

def scan(sourcedir:str, oset:set[str]) -> set[str]:
    regex = re.compile('|'.join(map(re.escape, oset)))
    return set(filter(regex.search, os.listdir(sourcedir)))

Note that you have the oset parameter typed as set[str]|str, which makes little sense since a container of strings and a string can't be handled in a consistent manner. I've made it to be typed as set[str] instead in my example.

Upvotes: 3

Kaleb Fenley
Kaleb Fenley

Reputation: 226

You could try glob

I don't have a directory with 300,000 files to test it on, but I'm assuming it would be pretty quick, (a few seconds).

import glob

sourcedir = r'path\to\your\files'
oset = ['some','list','not','shown','in','your','code']

found = []
for ordr in oset:
# Get a list of all files in the "sourcedir" directory with "ordr" in the filename
    files = [f for f in glob.glob(f"{sourcedir}\*{ordr}*")]
    found.extend(files)

print('\n'.join(found))

Upvotes: 0

Related Questions