mk8efz
mk8efz

Reputation: 1424

Targeting a directory with os.walk

Due to a large and convoluted directory structure, my script is searching too many directories:

root--
     |
     --Project A--
                  |
                  -- Irrelevant
                  -- Irrelevant
                  -- TARGET
     |
     --Project B--
                  |
                  -- Irrelevant
                  -- TARGET
                  -- Irrelevant
     |
     -- Irrelevant  --
                       |
                       --- Irrelevant

The TARGET directory is the only one I need to traverse and it has a consistent name in each project (we'll just call it Target here).

I looked at this question:

Excluding directories in os.walk

but instead of excluding, I need to include the "target" directory which isn't at the "root" level, but one level down.

I've tried something to the like of:

def walker(path):
    for dirpath, dirnames, filenames in os.walk(path):
        dirnames[:] = set(['TARGET'])

But this one effects the root directory (thereby ignoring all the directories it needs to traverse, Project A, Project B...)

Upvotes: 4

Views: 2365

Answers (2)

ShadowRanger
ShadowRanger

Reputation: 155418

For a whitelisting scenario like this, I'd suggest using glob.iglob to get the directories by a pattern. It's a generator, so you'll get each result as fast as it finds them (Note: At time of writing, it's still implemented with os.listdir under the hood, not os.scandir, so it's only half a generator; each directory is scanned eagerly, but it only scans the next directory once it's finished yielding values from the current directory). For example, in this case:

from future_builtins import filter  # Only on Py2 to get generator based filter

import os.path
import glob

from operator import methodcaller

try:
    from os import scandir       # Built-in on 3.5 and above
except ImportError:
    from scandir import scandir  # PyPI package on 3.4 and below

# If on 3.4+, use glob.escape for safety; before then, if path might contain glob
# special characters and you don't want them processed you need to escape manually
globpat = os.path.join(glob.escape(path), '*', 'TARGET')

# Find paths matching the pattern, filtering out non-directories as we go:
for targetdir in filter(os.path.isdir, glob.iglob(globpat)):
    # targetdir is the qualified name of a single directory matching the pattern,
    # so if you want to process the files in that directory, you can follow up with:
    for fileentry in filter(methodcaller('is_file'), scandir(targetdir)):
        # fileentry is a DirEntry with attributes for .name, .path, etc.

See the docs on os.scandir for more advanced usage, or you can just make the inner loop a call to os.walk to preserve most of your original code as is.

If you really must use os.walk, you can just be more targeted in how you prune dirs. Since you specified all TARGET directories should be only one level down, this is actually pretty easy. os.walk walks top down by default, which means the first set of results will be the root directory (which you don't want to prune solely to TARGET entries). So you can do:

import fnmatch

for i, (dirpath, dirs, files) in enumerate(os.walk(path)):
    if i == 0:
        # Top level dir, prune non-Project dirs
        dirs[:] = fnmatch.filter(dirs, 'Project *')
    elif os.path.samefile(os.path.dirname(dirpath), path):
        # Second level dir, prune non-TARGET dirs
        dirs[:] = fnmatch.filter(dirs, 'TARGET')
    else:
        # Do whatever handling you'd normally do for files and directories
        # located under path/Project */TARGET/

Upvotes: 2

Bakuriu
Bakuriu

Reputation: 101959

The issue with your code is that you are always modifying the dirnames list, but this means that even at the root level all the subdirectories are removed and hence the recursive calls do not end up visiting the various Project X directories.

What you want is to purge other directories only when the TARGET one is present:

if 'TARGET' in dirnames:
    dirnames[:] = ['TARGET']

This will allow the os.walk call to visit the Project X directories, but will prevent it from going inside the Irrelevant ones.

Upvotes: 4

Related Questions