HeadInTheClouds
HeadInTheClouds

Reputation: 23

How do I handling exceptions with python generators using spaCy

I am using the spacy language.pipe method to process texts as a stream, and yield Doc objects in order. (https://spacy.io/api/language#pipe).

This method is faster than processing files one by one and takes a generator object as input.

If the system hits a "bad file" I want to ensure that I can identify it. However, I am not sure how to achieve this with Python generators. What is the best approach for ensuring I capture the error? I don't currently have a file to cause an error but will likely find one in production.

I am using spaCy version 2.1 and Python 3.6.3

import os
import spacy

nlp = spacy.load('en')

def genenerator():
    path = "C:/Temp/tmp/" #place any text files here for testing

    try:
        for root, _, files in os.walk(path, topdown=False):
            for name in files:
                with open(os.path.join(root, name), 'r', encoding='utf-8', errors='ignore') as inputFileStream:
                    docText = inputFileStream.read()
                yield (docText, name)

    except Exception as e:
        print('Error opening document. Doc name: {}'.format(os.path.join(root, name)), str(e))

def processfiles():
    try:
        for doc, file in nlp.pipe(genenerator(), as_tuples = True, batch_size=1000):
            print (file)

    except Exception as e:
        print('Error processing file: {}'.format(file), str(e))

if __name__ == '__main__':
    processfiles()

Edit - I have attempted to better explain my problem.

The specific thing I need to be able to do is to identify exactly what file caused a problem to spaCy, in particular I want to know exactly what file fails during this statement for doc, file in nlp.pipe(genenerator(), as_tuples = True, batch_size=1000):

My assumption is that it could be possible to run into a file that causes spaCy to have an issue during the pipe statement (for example during the tagger or parser processing pipeline).

Originally I was processing the text into spaCy file by file, so if spaCy had a problem then I knew exactly what file caused it. Using a generator this seems to be harder. I am confident that errors that occur in the generator method itself can be captured, especially taking on board the comments by John Rutledge.

Perhaps a better way to ask the question is how to I handle exception when generators are passed to methods like this. My understanding is that the PIPE method will process the generator as a stream.

Upvotes: 2

Views: 885

Answers (1)

John Rutledge
John Rutledge

Reputation: 672

It looks like your main problem is that your try/catch statement will currently halt execution on the first error it encounters. To continue yielding files when an error is encountered you need to place your try/catch further down in the for-loop, i.e. you can wrap the with open context manager.

Note also that a blanket try/catch is considered an anti-pattern in Python, so typically you will want to catch and handle the errors explicitly instead of using the general purpose Exception. I included the more explicit IOerror and OSError as examples.

Lastly, because you can catch the errors in the generator itself the nlp.pipe function no longer needs the as_tuple param.

from pathlib import Path
import spacy


def grab_files(path):
    for path in Path(path).rglob('*'):
        if path.is_file():
            try:
                with open(str(path), 'r', encoding='utf-8', errors='ignore') as f:
                    yield f.read()
            except (OSError, IOError) as err:
                print(f'ERROR: {path}', err)


nlp = spacy.load('en')
for doc in nlp.pipe(grab_files('C:/Temp/tmp/'), batch_size=1000):
    print(doc)  # ... do something with spacy Doc here

*Edit - to answer followup question.

Note, you are still reading the contents of the text documents one at a time as you would have without a generator, however doing so via a generator returns an object that defers the execution until after you pass it into the nlp.pipe method. SpaCy then processes one batch of the text documents at a time via its internal util.minibatch function. That function ends in yield list(batch) which executes the code that opens/closes the files (1000 at a time in your case). So as regards any non-SpaCy related errors, i.e. errors associated with the opening/reading of the file, the code I posted should work as is.

However, as it stands, both your os.walk and my Path(path).rglob are indiscriminately picking up any file in the directory regardless of its filetype. So for example, if there were an .png file in your /tmp folder then SpaCy would raise a TypeError during the tokenization process. If you are wanting to capture those kinds of errors then your best bet is to anticipate and avoid them before sending them to SpaCy, e.g., by amending your code with a whitelist that only allows certain file extensions (.rglob('*.txt')).

If you are working on a project that for some reason or another cannot afford to be interrupted by an error, no matter the cost. And supposing you absolutely needed to know at which stage of the pipeline the error occurred, then one approach might be to create a custom pipeline component for each default SpaCy pipeline component (Tagger, DependencyParser, etc) you intend to use. You would then need to wrap said components in the blanket error handling/logging logic. Having done that you could then process your files using your completely custom pipeline. But, unless there is a gun pointed at your head I would not recommend it. Much better would be to anticipate the errors you expect to occur and handle them inside your generator. Perhaps someone with better knowledge of SpaCy's internals will have a better suggestion though.

Upvotes: 1

Related Questions