Reputation: 1424
Due to a large and convoluted directory structure, my script is searching too many directories:
root--
|
--Project A--
|
-- Irrelevant
-- Irrelevant
-- TARGET
|
--Project B--
|
-- Irrelevant
-- TARGET
-- Irrelevant
|
-- Irrelevant --
|
--- Irrelevant
The TARGET directory is the only one I need to traverse and it has a consistent name in each project (we'll just call it Target here).
I looked at this question:
Excluding directories in os.walk
but instead of excluding, I need to include the "target" directory which isn't at the "root" level, but one level down.
I've tried something to the like of:
def walker(path):
for dirpath, dirnames, filenames in os.walk(path):
dirnames[:] = set(['TARGET'])
But this one effects the root directory (thereby ignoring all the directories it needs to traverse, Project A, Project B...)
Upvotes: 4
Views: 2365
Reputation: 155418
For a whitelisting scenario like this, I'd suggest using glob.iglob
to get the directories by a pattern. It's a generator, so you'll get each result as fast as it finds them (Note: At time of writing, it's still implemented with os.listdir
under the hood, not os.scandir
, so it's only half a generator; each directory is scanned eagerly, but it only scans the next directory once it's finished yielding values from the current directory). For example, in this case:
from future_builtins import filter # Only on Py2 to get generator based filter
import os.path
import glob
from operator import methodcaller
try:
from os import scandir # Built-in on 3.5 and above
except ImportError:
from scandir import scandir # PyPI package on 3.4 and below
# If on 3.4+, use glob.escape for safety; before then, if path might contain glob
# special characters and you don't want them processed you need to escape manually
globpat = os.path.join(glob.escape(path), '*', 'TARGET')
# Find paths matching the pattern, filtering out non-directories as we go:
for targetdir in filter(os.path.isdir, glob.iglob(globpat)):
# targetdir is the qualified name of a single directory matching the pattern,
# so if you want to process the files in that directory, you can follow up with:
for fileentry in filter(methodcaller('is_file'), scandir(targetdir)):
# fileentry is a DirEntry with attributes for .name, .path, etc.
See the docs on os.scandir
for more advanced usage, or you can just make the inner loop a call to os.walk
to preserve most of your original code as is.
If you really must use os.walk
, you can just be more targeted in how you prune dirs
. Since you specified all TARGET
directories should be only one level down, this is actually pretty easy. os.walk
walks top down by default, which means the first set of results will be the root directory (which you don't want to prune solely to TARGET
entries). So you can do:
import fnmatch
for i, (dirpath, dirs, files) in enumerate(os.walk(path)):
if i == 0:
# Top level dir, prune non-Project dirs
dirs[:] = fnmatch.filter(dirs, 'Project *')
elif os.path.samefile(os.path.dirname(dirpath), path):
# Second level dir, prune non-TARGET dirs
dirs[:] = fnmatch.filter(dirs, 'TARGET')
else:
# Do whatever handling you'd normally do for files and directories
# located under path/Project */TARGET/
Upvotes: 2
Reputation: 101959
The issue with your code is that you are always modifying the dirnames
list, but this means that even at the root level all the subdirectories are removed and hence the recursive calls do not end up visiting the various Project X
directories.
What you want is to purge other directories only when the TARGET
one is present:
if 'TARGET' in dirnames:
dirnames[:] = ['TARGET']
This will allow the os.walk
call to visit the Project X
directories, but will prevent it from going inside the Irrelevant
ones.
Upvotes: 4