Reputation: 915
As a test, using Spacy I am punctuating text after identifying with span.
import spacy, en_core_web_sm
from spacy.matcher import Matcher
# Read input file
nlp = spacy.load('en_core_web_sm')
matcher = Matcher(nlp.vocab)
Punctuation_patterns = [[{'POS': 'NOUN'},{'POS': 'NOUN'},{'POS': 'NOUN'}],
]
matcher.add('PUNCTUATION', None, *Punctuation_patterns)
doc = nlp("The cat cat cat sat on the mat. The dog sat on the mat.")
matches = matcher(doc)
spans = []
for match_id, start, end in matches:
span = doc[start:end] # the matched slice of the doc
spans.append({'start': span.start_char, 'end': span.end_char})
layer1 = (' '.join(['"{}"'.format(span.text)if token.dep_ == 'ROOT' else '{}'.format(token) for token in doc]))
print (layer1)
Output:
The cat cat cat "cat cat cat" on the mat . The dog "cat cat cat" on the mat .
Expected output
The "cat cat cat" sat on the mat. The dog sat on the mat.
I am just testing with ROOT, how to identify span matches using spacy to get desired output?
Edit 1: In case of multiple detection like dog dog dog
for match_id, start, end in matches:
span = doc[start:end] # the matched slice of the doc
spans.append({'start': span.start_char, 'end': span.end_char})
result = doc.text
for match_id, start, end in matches:
span = doc[start:end]
result = result.replace(span.text, f'"{span.text}"', 1)
print (result)
Current output:
The "cat cat cat" sat on the mat. The dog dog dog sat on the mat.
The "cat cat cat" sat on the mat. The "dog dog dog" sat on the mat.
Expected:
The "cat cat cat" sat on the mat. The "dog dog dog" sat on the mat.
Upvotes: 0
Views: 368
Reputation: 627536
You may use
result = doc.text
for match_id, start, end in matches:
span = doc[start:end]
result = result.replace(span.text, f'"{span.text}"', 1)
print (result)
That is, you defne a variable to keep the result, result
, and assign it with the doc.text
value. Then, you go throug the matches and replace each matched span with the same span text wrapped with double quotation marks.
Upvotes: 1