How to remove ORG names and GPE from noun chunk in spacy

Question

I have the following code

import spacy
from spacy.tokens import Span
import en_core_web_lg
nlpsm = en_core_web_lg.load()

doc = nlpsm(text)

finalwor = []
    fil = [i for i in doc.ents if i.label_.lower() in ["person"]]
    fil_a = [i for i in doc.ents if i.label_.lower() in ['GPE']]
    fil_b = [i for i in doc.ents if i.label_.lower() in ['ORG']]
    for chunk in doc.noun_chunks:
        if chunk not in fil and chunk not in fil_a and chunk not in fil_b:
            finalwor=list(doc.noun_chunks)
            print("finalwor after noun_chunk", finalwor)
        else: 
            chunk in fil_a and chunk in fil_b
            entword=list(str(chunk.text).replace(str(chunk.text),""))
            finalwor.extend(entword)

I am not sure what I am doing wrong here. If the text is 'IT manager at Google'

My current output is "IT manager, Google'

Ideal output that I want is "IT manager".

Basically I want the company names and GPE names to replaced by empty string or just plainly just delete it.

dee · Accepted Answer

I think here, finalwor=list(doc.noun_chunks), you are appending all the nouns that appear in your doc to the final word instead of just the noun that justifies your statement

You might be looking for something like this:

import spacy
from spacy.tokens import Span
import en_core_web_lg
nlpsm = en_core_web_lg.load()

doc = nlpsm('Maria, IT manager at Google and gardener')

finalwor = []
fil = [i for i in doc.ents if i.label_.lower() in ["person"]]
fil_a = [i for i in doc.ents if i.label_.lower() in ['gpe']]
fil_b = [i for i in doc.ents if i.label_.lower() in ['org']]

for chunk in doc.noun_chunks:
    if chunk not in fil and chunk not in fil_a and chunk not in fil_b:
        finalwor.append(chunk)

print("finalwor after noun_chunk", finalwor)

finalwor after noun_chunk [IT manager, gardener]

How to remove ORG names and GPE from noun chunk in spacy

Answers (1)

Related Questions