user11749375
user11749375

Reputation:

How to write code to merge punctuations and phrases using spaCy

What I would like to do

I would like to perse and dependency analysis using spaCy, one of the open-source libraries for natural language processing.

And especially, I hope to know how to write code for the option to merge punctuations and phrases in Python.

Problem

There are bottons to mearge punctuations and phrases on the displaCy Dependency Vizualizer Web App. enter image description here

However, I cannot find the way to write these options when it comes to writing code in the local environment.

The current code returns the following not merged version. enter image description here

The sample sentence is from your dictionary.

Current Code

It is from the sample code on the spaCy official website.

Please let me know how to fix it to set punctuations and phrases merge options.

import spacy
from spacy import displacy
nlp = spacy.load("en_core_web_sm")

sentence = "On Christmas Eve, we sit in front of the fire and take turns reading Christmas stories."

doc = nlp(sentence)
displacy.render(doc, style="dep")

What I tried to do

There was one example for the merge implementation. However it didn't work when I apply the sentence.

import spacy

nlp = spacy.load("en_core_web_sm")
doc = nlp("On Christmas Eve, we sit in front of the fire and take turns reading Christmas stories.")
span = doc[doc[4].left_edge.i : doc[4].right_edge.i+1]
with doc.retokenize() as retokenizer:
    retokenizer.merge(span)
for token in doc:
    print(token.text, token.dep_, token.head.text, token.head.pos_,
            [child for child in token.children])

Example Code

import spacy
nlp = spacy.load("en_core_web_sm")
doc = nlp("Credit and mortgage account holders must submit their requests")
span = doc[doc[4].left_edge.i : doc[4].right_edge.i+1]
with doc.retokenize() as retokenizer:
    retokenizer.merge(span)
for token in doc:
    print(token.text, token.pos_, token.dep_, token.head.text)

Upvotes: 3

Views: 2204

Answers (1)

doraTheExplorer
doraTheExplorer

Reputation: 71

If you need to merge noun chunks, check out the built-in merge_noun_chunks pipeline component. When added to your pipeline using nlp.add_pipe, it will take care of merging the spans automatically.

You can just use the code from the displaCy Dependency Vizualizer:

import spacy
nlp = spacy.load("en_core_web_sm")

def merge_phrases(doc):
    with doc.retokenize() as retokenizer:
        for np in list(doc.noun_chunks):
            attrs = {
                "tag": np.root.tag_,
                "lemma": np.root.lemma_,
                "ent_type": np.root.ent_type_,
            }
            retokenizer.merge(np, attrs=attrs)
    return doc

def merge_punct(doc):
    spans = []
    for word in doc[:-1]:
        if word.is_punct or not word.nbor(1).is_punct:
            continue
        start = word.i
        end = word.i + 1
        while end < len(doc) and doc[end].is_punct:
            end += 1
        span = doc[start:end]
        spans.append((span, word.tag_, word.lemma_, word.ent_type_))
    with doc.retokenize() as retokenizer:
        for span, tag, lemma, ent_type in spans:
            attrs = {"tag": tag, "lemma": lemma, "ent_type": ent_type}
            retokenizer.merge(span, attrs=attrs)
    return doc

text = "On Christmas Eve, we sit in front of the fire and take turns reading Christmas stories."

doc = nlp(text)
# Merge noun phrases into one token.
doc = merge_phrases(doc)
# Attach punctuation to tokens
doc = merge_punct(doc)

for token in doc:
    print(token.text, token.pos_, token.dep_, token.head.text)

Upvotes: 3

Related Questions