Timmah
Timmah

Reputation: 2121

python generator to list exception re-raised

I have a simple poller class (code snippet below) which retrieves files from a number of folders based on a regex. I attempt to catch OSError exceptions and ignore them as files could be moved out/deleted/permissions etc... During some testing (in which i created/deleted a large nr of files) i noticed that when sorting the generator, the exceptions that were raised in the generator function (_get) were re-raised(?), and i had to use an additional try except block to get around this.

Any idea why this is happening? All comments/improvements appreciated!

Thanks Timmah

def __init__(self, **kwargs):
    self._sortkey = kwargs.get('sortkey', os.path.getmtime)

def _get(self, maxitems=0):
    def customfilter(f):
        if self._exclude is not None and self._exclude.search(f): return False
        if self._regex is not None:
            return self._regex.search(f)

        return True

    count = 0
    for p in self.paths:
        if not os.path.isdir(p): raise PollException("'%s' is not a valid path." % (p), p)
        if maxitems and count >= maxitems: break
        try:
            for f in [os.path.join(p, f) for f in filter(customfilter, os.listdir(p))]:
                if maxitems and count >= maxitems: break

                if not self._validate(f): continue

                count += 1
                yield f
        except OSError:
            '''
            There will be instances where we wont have permission on the file/directory or
            when a file is moved/deleted before it was yielded.
            '''
            continue

def get(self, maxitems=0):
    try:
        if self._sortkey is not None:
            files = sorted(self._get(maxitems), key=self._sortkey, reverse=self._sortreverse)**
        else:
            files = self._get(maxitems)
    except OSError:
        '''
        self._sortkey uses os.path function to sort so exceptions can happen again
        '''
        return

    for f in files:
        yield f

if __name__ == '__main__':
    while True:
        for f in poll(paths=['/tmp'], regex="^.*\.CSV").get(10):
            print f

EDIT: Thanks to @ShadowRanger for pointing out the os.path function that was passed as sortkey param.

Upvotes: 0

Views: 161

Answers (1)

ShadowRanger
ShadowRanger

Reputation: 155363

Posting an answer for posterity: Per psychic intuition (and confirmation in the comments), self._sortkey was trying to stat the files being sorted. While having read permission on a directory is sufficient to get the filenames contained within it, if you lack read permission on those files, you won't be able to stat them.

Since sorted is executing the key function outside the generator scope, nothing in the generator is raising the exception, and therefore it can't catch it. You'd need to pre-filter/pre-compute the stat values for each file (and drop files that can't be stat-ed), sort on that, then drop the (no longer relevant) stat data. For example:

from operator import itemgetter

def with_key(filenames, key):
    '''Generates computed_key, filename pairs

    Silently filters out files where the key function raises OSError
    '''
    for f in filenames:
        try:
            yield key(f), f
        except OSError:
            pass

# ... skipping to the `sorted` call in get ...
# Replace the existing sorted call with:
# map(itemgetter(1), strips the key, yielding only the file name
files = map(itemgetter(1),
            sorted(
                   # Use with_key to filter and decorate filenames with sortkey
                   with_key(self._get(maxitems), self._sortkey),
                   # Use key=itemgetter(0) so only sortkey is considered for
                   # sorting (making sort stable, instead of performing fallback
                   # comparison between filenames when key is the same)
                   key=itemgetter(0), reverse=self._sortreverse))

It's basically performing the Schwartzian Transform (aka "Decorate-Sort-Undecorate") manually. Normally, Python's key argument for sorted/list.sort hides this complexity from you, but in this case, thanks to the possibility of exceptions, the need to drop the item if one occurs and the desire to minimize race conditions by using EAFP patterns), you have to do the work yourself.

Alternate solution with Python 3.5 (or 2.6-2.7 and 3.2-3.4 using the third party scandir package):

You could avoid this issue (and on Windows, include unreadable files in your output so long as the directory was readable and on a Windows-like file system that caches file metadata in the directory entry) if you so desired, with far less complexity and likely better performance. os.scandir (or pre-3.5, scandir.scandir) on Windows gets you the stat information cached in the directory entry "for free" (you only pay the RTT cost once per few thousands entries in a directory, not once per file), and on Linux the first call to DirEntry.stat caches the stat data, so doing it in _get means you can catch and handle OSError there, populating the cache so during sorting, self._sortkey can use the cached data with no risk of OSError. So you could do:

try:
    from os import scandir
except ImportError:
    from scandir import scandir

# Prestat will ensure OSErrors raised in _get, not in caller using DirEntry
def _get(self, maxitems=0, prestat=True, follow_symlinks=True):
    def customfilter(f):
        if self._exclude is not None and self._exclude.search(f):
            return False
        return self._regex is None or self._regex.search(f)

    count = 0
    for p in self.paths:
        if not os.path.isdir(p): raise PollException("'%s' is not a valid path." % (p,), p)
        if maxitems and count >= maxitems: break
        try:
            # Use scandir over listdir, and since we get DirEntrys, we
            # don't need to explicitly use os.path.join to make full paths
            # and we can use genexpr for validation instead
            for dirent in (de for de in scandir(p) if customfilter(de.name) and self._validate(de.path)):
                # On Windows, stat() is cheap noop (returns precomputed data)
                # except symlink w/follow_symlinks=True (where it stats and caches)
                # On Linux, this will force a stat now, and cache the result
                # so OSErrors will only be raised here, not during sorting
                if prestat:
                    dirent.stat(follow_symlinks=follow_symlinks)

                if maxitems and count >= maxitems: break

                count += 1
                yield dirent
        except OSError:
            '''
            There will be instances where we wont have permission on the file/directory or
            when a file is moved/deleted before it was yielded.
            '''
            continue

def get(self, maxitems=0):
    # Prestat if we have a sortkey (assuming it may use stat data)
    files = self._get(maxitems, prestat=self._sortkey is not None)
    if self._sortkey is not None:
        # self._sortkey must now operate on a os.DirEntry
        # but no more need to wrap in try/except OSError
        files = sorted(files, key=self._sortkey, reverse=self._sortreverse)

    # To preserve observable public behaviors, return path, not DirEntry
    for dirent in files:
        yield dirent.path

This requires a small change in usage; self._sortkey must operate on an os.DirEntry instance, not a file path. So instead of self._sortkey = kwargs.get('sortkey', os.path.getmtime), you might have self._sortkey = kwargs.get('sortkey', lambda de: de.stat().st_mtime).

But it avoids the complexity of manual Schwartzian Transforms (because access violations can only occur in _get's try/except as long as you don't change prestat, so no OSErrors occur during key computation). It will also likely run faster, by lazily iterating the directory instead of constructing a complete list before iterating (admittedly a small benefit unless the directory is huge) and removing the need to use a stat system call at all for most directory entries on Windows.

Upvotes: 1

Related Questions