Reputation: 23
I am using the spacy language.pipe method to process texts as a stream, and yield Doc objects in order. (https://spacy.io/api/language#pipe).
This method is faster than processing files one by one and takes a generator object as input.
If the system hits a "bad file" I want to ensure that I can identify it. However, I am not sure how to achieve this with Python generators. What is the best approach for ensuring I capture the error? I don't currently have a file to cause an error but will likely find one in production.
I am using spaCy version 2.1 and Python 3.6.3
import os
import spacy
nlp = spacy.load('en')
def genenerator():
path = "C:/Temp/tmp/" #place any text files here for testing
try:
for root, _, files in os.walk(path, topdown=False):
for name in files:
with open(os.path.join(root, name), 'r', encoding='utf-8', errors='ignore') as inputFileStream:
docText = inputFileStream.read()
yield (docText, name)
except Exception as e:
print('Error opening document. Doc name: {}'.format(os.path.join(root, name)), str(e))
def processfiles():
try:
for doc, file in nlp.pipe(genenerator(), as_tuples = True, batch_size=1000):
print (file)
except Exception as e:
print('Error processing file: {}'.format(file), str(e))
if __name__ == '__main__':
processfiles()
Edit - I have attempted to better explain my problem.
The specific thing I need to be able to do is to identify exactly what file caused a problem to spaCy, in particular I want to know exactly what file fails during this statement for doc, file in nlp.pipe(genenerator(), as_tuples = True, batch_size=1000):
My assumption is that it could be possible to run into a file that causes spaCy to have an issue during the pipe statement (for example during the tagger or parser processing pipeline).
Originally I was processing the text into spaCy file by file, so if spaCy had a problem then I knew exactly what file caused it. Using a generator this seems to be harder. I am confident that errors that occur in the generator method itself can be captured, especially taking on board the comments by John Rutledge.
Perhaps a better way to ask the question is how to I handle exception when generators are passed to methods like this. My understanding is that the PIPE method will process the generator as a stream.
Upvotes: 2
Views: 885
Reputation: 672
It looks like your main problem is that your try/catch statement will currently halt execution on the first error it encounters. To continue yielding files when an error is encountered you need to place your try/catch
further down in the for-loop, i.e. you can wrap the with open
context manager.
Note also that a blanket try/catch
is considered an anti-pattern in Python, so typically you will want to catch and handle the errors explicitly instead of using the general purpose Exception
. I included the more explicit IOerror
and OSError
as examples.
Lastly, because you can catch the errors in the generator itself the nlp.pipe
function no longer needs the as_tuple
param.
from pathlib import Path
import spacy
def grab_files(path):
for path in Path(path).rglob('*'):
if path.is_file():
try:
with open(str(path), 'r', encoding='utf-8', errors='ignore') as f:
yield f.read()
except (OSError, IOError) as err:
print(f'ERROR: {path}', err)
nlp = spacy.load('en')
for doc in nlp.pipe(grab_files('C:/Temp/tmp/'), batch_size=1000):
print(doc) # ... do something with spacy Doc here
*Edit - to answer followup question.
Note, you are still reading the contents of the text documents one at a time as you would have without a generator, however doing so via a generator returns an object that defers the execution until after you pass it into the nlp.pipe
method. SpaCy then processes one batch of the text documents at a time via its internal util.minibatch
function. That function ends in yield list(batch)
which executes the code that opens/closes the files (1000 at a time in your case). So as regards any non-SpaCy related errors, i.e. errors associated with the opening/reading of the file, the code I posted should work as is.
However, as it stands, both your os.walk
and my Path(path).rglob
are indiscriminately picking up any file in the directory regardless of its filetype. So for example, if there were an .png
file in your /tmp
folder then SpaCy would raise a TypeError
during the tokenization process. If you are wanting to capture those kinds of errors then your best bet is to anticipate and avoid them before sending them to SpaCy, e.g., by amending your code with a whitelist that only allows certain file extensions (.rglob('*.txt')
).
If you are working on a project that for some reason or another cannot afford to be interrupted by an error, no matter the cost. And supposing you absolutely needed to know at which stage of the pipeline the error occurred, then one approach might be to create a custom pipeline component for each default SpaCy pipeline component (Tagger, DependencyParser, etc) you intend to use. You would then need to wrap said components in the blanket error handling/logging logic. Having done that you could then process your files using your completely custom pipeline. But, unless there is a gun pointed at your head I would not recommend it. Much better would be to anticipate the errors you expect to occur and handle them inside your generator. Perhaps someone with better knowledge of SpaCy's internals will have a better suggestion though.
Upvotes: 1