pauljdurack
pauljdurack

Reputation: 39

Create list using regex inputs

I'm searching a very convoluted directory tree using os.walk in python 2.7.7 and want to limit the searching by using a trim in place for the resultant directories

import os,re
dirExclude = set(['amip4K','amip4xCO2','aqua4K','aqua4xCO2'])
for (path,dirs,files) in os.walk(inpath,topdown=True):
     dirs[:] = [d for d in dirs if d not in dirExclude]
     # Do something

I want to append to this dirExclude list/set anything that matches the regular expression r'decadal[0-9]{4}', however am having a hard time determining how best to use a regular expression in my list/set definition?

Any suggestions here? Or indeed a more efficient way to use the os.walk function?

After a number of suggestions the above can be improved to:

import os,re
dirExclude = set(['amip4K','amip4xCO2','aqua4K','aqua4xCO2'])
decExclude = re.compile(r'decadal[0-9]{4}')
for (path,dirs,files) in os.walk(inpath,topdown=True):
     dirs[:] = [d for d in dirs if d not in dirExclude and not re.search(decExclude,d)]
     # Do something

After investigating the dir[:] = versus dir = assignment, the [:] is needed to ensure that os.walk uses the pruned directory listing, rather than the full (pre-pruned) directory listing

Upvotes: 2

Views: 679

Answers (3)

Joel Cornett
Joel Cornett

Reputation: 24788

Augmenting the previous suggestions, you can use ifilterfalse (or filterfalse in Python 3.x) to efficiently filter on a regular expression:

from itertools import ifilterfalse
import re
import os

exclude = {'foo', 'bar', 'baz'}
expr = re.compile(r'decadal\d{4}')
for (path, dirs, files) in os.walk(inpath):
    dirs[:] = set(ifilterfalse(expr.match, dirs)) - exclude

Some further notes:

  • Simply doing dir = [alist] is insufficient because this only modifies what the local label dir is referring to (i.e. it is no longer referring to the the dirs list that os.walk uses). You must modify the actual list that dirs list that os.walk references. You can do this (as above) by doing the slice assignment operator. This more or less equivalent to the expression: dirs.__setitem__(slice(None, None), [alist])

Upvotes: 1

timgeb
timgeb

Reputation: 78690

Instead of adding to dirExclude, why not just check whether there's a match for r'decadal[0-9]{4}' in a dirname d?

I'm thinking of something like this:

import re
dirExclude = set(['amip4K','amip4xCO2','aqua4K','aqua4xCO2'])
exre = re.compile(r'decadal[0-9]{4}')
for (path,dirs,files) in os.walk(inpath,topdown=True):
     dirs = [d for d in dirs if d not in dirExclude and not exre.search(d)]
     # Do something

Explanation:

exre.search(d) will return None if there is no match for your regex inside d. not None will then evaluate to True. Otherwise, exre.search(d) will return a MatchObject and not exre.search(d) will evaluate to False.

Compiling the regular expression is optional. Without compiling, you would issue

exre = r'decadal[0-9]{4}'

and

dirs = [d for d in dirs if d not in dirExclude and not re.search(exre, d)]

Compiling can be useful when you need to apply a regex a lot of times in order to do the compiling part only once. However, most of the time you won't notice a difference, as even if you don't compile the regex manually Python will cache the last used regexes. To be precise, the last one hundred regexes, though the only reference I got for this is the Regular Expression Cookbook by Jan Goyvaerts and Steven Levithan.

Upvotes: 1

matsjoyce
matsjoyce

Reputation: 5844

If you simply want to avoid all directories that match the re, you could do:

d_re = re.compile(r'decadal[0-9]{4}')
dirs = [d for d in dirs if d_re.match(d) is None]

You could retrieve all the ignored files at the end by:

 dirExclude = dirExclude.union(d for d in dirs if d not in dirExclude)

or

[dirExclude.add(d) for d in dirs if d not in dirExclude]

Upvotes: 0

Related Questions