Programmer_nltk
Programmer_nltk

Reputation: 915

Pattern based punctuation using Spacy

As a test, using Spacy I am punctuating text after identifying with span.

import spacy, en_core_web_sm
from spacy.matcher import Matcher

# Read input file
nlp = spacy.load('en_core_web_sm')

matcher = Matcher(nlp.vocab)
Punctuation_patterns = [[{'POS': 'NOUN'},{'POS': 'NOUN'},{'POS': 'NOUN'}],
                        ]

matcher.add('PUNCTUATION', None, *Punctuation_patterns)
doc = nlp("The cat cat cat sat on the mat. The dog sat on the mat.")
matches = matcher(doc)
spans = []
for match_id, start, end in matches:
    span = doc[start:end]  # the matched slice of the doc
    spans.append({'start': span.start_char, 'end': span.end_char})
    layer1 = (' '.join(['"{}"'.format(span.text)if token.dep_ == 'ROOT'  else '{}'.format(token) for token in doc]))
    print (layer1)

Output:

The cat cat cat "cat cat cat" on the mat . The dog "cat cat cat" on the mat .

Expected output

The "cat cat cat" sat on the mat. The dog sat on the mat.

I am just testing with ROOT, how to identify span matches using spacy to get desired output?

Edit 1: In case of multiple detection like dog dog dog

for match_id, start, end in matches:
    span = doc[start:end]  # the matched slice of the doc
    spans.append({'start': span.start_char, 'end': span.end_char})
    result = doc.text

for match_id, start, end in matches:
    span = doc[start:end]
    result = result.replace(span.text, f'"{span.text}"', 1)
    print (result)

Current output:

The "cat cat cat" sat on the mat. The dog dog dog sat on the mat.
The "cat cat cat" sat on the mat. The "dog dog dog" sat on the mat.

Expected:

  The "cat cat cat" sat on the mat. The "dog dog dog" sat on the mat.

Upvotes: 0

Views: 368

Answers (1)

Wiktor Stribiżew
Wiktor Stribiżew

Reputation: 627536

You may use

result = doc.text
for match_id, start, end in matches:
    span = doc[start:end]
    result = result.replace(span.text, f'"{span.text}"', 1)
print (result)

That is, you defne a variable to keep the result, result, and assign it with the doc.text value. Then, you go throug the matches and replace each matched span with the same span text wrapped with double quotation marks.

Upvotes: 1

Related Questions