Reputation:
I would like to perse and dependency analysis using spaCy, one of the open-source libraries for natural language processing.
And especially, I hope to know how to write code for the option to merge punctuations and phrases in Python.
There are bottons to mearge punctuations and phrases on the displaCy Dependency Vizualizer Web App.
However, I cannot find the way to write these options when it comes to writing code in the local environment.
The current code returns the following not merged version.
The sample sentence is from your dictionary.
It is from the sample code on the spaCy official website.
Please let me know how to fix it to set punctuations and phrases merge options.
import spacy
from spacy import displacy
nlp = spacy.load("en_core_web_sm")
sentence = "On Christmas Eve, we sit in front of the fire and take turns reading Christmas stories."
doc = nlp(sentence)
displacy.render(doc, style="dep")
There was one example for the merge implementation. However it didn't work when I apply the sentence.
import spacy
nlp = spacy.load("en_core_web_sm")
doc = nlp("On Christmas Eve, we sit in front of the fire and take turns reading Christmas stories.")
span = doc[doc[4].left_edge.i : doc[4].right_edge.i+1]
with doc.retokenize() as retokenizer:
retokenizer.merge(span)
for token in doc:
print(token.text, token.dep_, token.head.text, token.head.pos_,
[child for child in token.children])
Example Code
import spacy
nlp = spacy.load("en_core_web_sm")
doc = nlp("Credit and mortgage account holders must submit their requests")
span = doc[doc[4].left_edge.i : doc[4].right_edge.i+1]
with doc.retokenize() as retokenizer:
retokenizer.merge(span)
for token in doc:
print(token.text, token.pos_, token.dep_, token.head.text)
Upvotes: 3
Views: 2204
Reputation: 71
If you need to merge noun chunks, check out the built-in merge_noun_chunks pipeline component. When added to your pipeline using nlp.add_pipe, it will take care of merging the spans automatically.
You can just use the code from the displaCy Dependency Vizualizer:
import spacy
nlp = spacy.load("en_core_web_sm")
def merge_phrases(doc):
with doc.retokenize() as retokenizer:
for np in list(doc.noun_chunks):
attrs = {
"tag": np.root.tag_,
"lemma": np.root.lemma_,
"ent_type": np.root.ent_type_,
}
retokenizer.merge(np, attrs=attrs)
return doc
def merge_punct(doc):
spans = []
for word in doc[:-1]:
if word.is_punct or not word.nbor(1).is_punct:
continue
start = word.i
end = word.i + 1
while end < len(doc) and doc[end].is_punct:
end += 1
span = doc[start:end]
spans.append((span, word.tag_, word.lemma_, word.ent_type_))
with doc.retokenize() as retokenizer:
for span, tag, lemma, ent_type in spans:
attrs = {"tag": tag, "lemma": lemma, "ent_type": ent_type}
retokenizer.merge(span, attrs=attrs)
return doc
text = "On Christmas Eve, we sit in front of the fire and take turns reading Christmas stories."
doc = nlp(text)
# Merge noun phrases into one token.
doc = merge_phrases(doc)
# Attach punctuation to tokens
doc = merge_punct(doc)
for token in doc:
print(token.text, token.pos_, token.dep_, token.head.text)
Upvotes: 3