user13593456
user13593456

Reputation:

Printing the remainder of a sentence after using spacy matcher to find the start of a target sentence

I am looking to take the aim out of a scientific journal abstract and am using spacy. I have a screenshot of the abstract and have run pytesseract on the image. I have tokenized the text into sentences with:

nlp = spacy.load("en_core_web_sm")
doc = nlp(text)

#Tokenize into sentences
sents = nlp.create_pipe("sentencizer")
nlp.add_pipe(sents)

[sent.text for sent in doc.sents]

Which seems to work quite well and gives me a list of sentences. I then made a rule based matcher that I believe matches the part of a sentence preceding the aim of the study:

#Rule based matching for AIM
matcher = Matcher(nlp.vocab)

pattern = [{"POS": "PART"}, {"POS": "VERB"}, {"POS": "DET"}, {"POS": "NOUN"}]       
matcher.add('Aim', None, pattern)

matches = matcher(doc)
for match_id, start, end in matches:
    string_id = nlp.vocab.strings[match_id]  
    span = doc[start:end]  
    print(span.text)

The matcher prints the target part of the sentence, so I know the matcher works (at least well enough for now and I can improve later). What I want to do now is run the matcher on each sentence and if it matches, print the sentence. I tried:

matches = matcher(doc.sents)
if matches:
 print(sent.text)

But it returns: TypeError: Argument 'doc' has incorrect type (expected spacy.tokens.doc.Doc, got generator)

Upvotes: 1

Views: 953

Answers (1)

user13593456
user13593456

Reputation:

For anyone interested I solved this by changing:

matches = matcher(doc)
for match_id, start, end in matches:
    string_id = nlp.vocab.strings[match_id]  
    span = doc[start:end]  
    print(span.text)

To:

matches = matcher(doc)
for match_id, start, end in matches:
    string_id = nlp.vocab.strings[match_id]  
    span = doc[start:end]  
    sents = span.sent  #ADDED THIS LINE
    print(sents)

Upvotes: 3

Related Questions