Reputation:
I am looking to take the aim out of a scientific journal abstract and am using spacy. I have a screenshot of the abstract and have run pytesseract on the image. I have tokenized the text into sentences with:
nlp = spacy.load("en_core_web_sm")
doc = nlp(text)
#Tokenize into sentences
sents = nlp.create_pipe("sentencizer")
nlp.add_pipe(sents)
[sent.text for sent in doc.sents]
Which seems to work quite well and gives me a list of sentences. I then made a rule based matcher that I believe matches the part of a sentence preceding the aim of the study:
#Rule based matching for AIM
matcher = Matcher(nlp.vocab)
pattern = [{"POS": "PART"}, {"POS": "VERB"}, {"POS": "DET"}, {"POS": "NOUN"}]
matcher.add('Aim', None, pattern)
matches = matcher(doc)
for match_id, start, end in matches:
string_id = nlp.vocab.strings[match_id]
span = doc[start:end]
print(span.text)
The matcher prints the target part of the sentence, so I know the matcher works (at least well enough for now and I can improve later). What I want to do now is run the matcher on each sentence and if it matches, print the sentence. I tried:
matches = matcher(doc.sents)
if matches:
print(sent.text)
But it returns: TypeError: Argument 'doc' has incorrect type (expected spacy.tokens.doc.Doc, got generator)
Upvotes: 1
Views: 953
Reputation:
For anyone interested I solved this by changing:
matches = matcher(doc)
for match_id, start, end in matches:
string_id = nlp.vocab.strings[match_id]
span = doc[start:end]
print(span.text)
To:
matches = matcher(doc)
for match_id, start, end in matches:
string_id = nlp.vocab.strings[match_id]
span = doc[start:end]
sents = span.sent #ADDED THIS LINE
print(sents)
Upvotes: 3