user3383301
user3383301

Reputation: 1921

spaCy NLP custom rule matcher

I am begginer with NLP. I am using spaCy python library for my NLP project. Here is my requirement,

I have a JSON File with all country names. Now i need to parse and get goldmedal count for the each countries in the document. Given below the sample sentence,

"Czech Republic won 5 gold medals at olympics. Slovakia won 0 medals olympics"

I am able to fetch country names but not it medal count. Given below my code. Please help to proceed further.

import json
from spacy.lang.en import English
from spacy.matcher import PhraseMatcher

with open("C:\Python36\srclcl\countries.json") as f:
    COUNTRIES = json.loads(f.read())

nlp = English()
nlp.add_pipe(nlp.create_pipe('sentencizer'))
doc = nlp("Czech Republic won 5 gold medals at olympics. Slovakia won 0 medals olympics")
matcher = PhraseMatcher(nlp.vocab)
patterns = list(nlp.pipe(COUNTRIES))

matcher.add("COUNTRY", None, *patterns)


for sent in doc.sents:
    subdoc = nlp(sent.text)
    matches = matcher(subdoc)
    print (sent.text)
    for match_id, start, end in matches:
        print(subdoc[start:end].text)

Also, if the given text is like ,

"Czech Republic won 5 gold medals at olympics in 1995. Slovakia won 0 medals olympics"

Upvotes: 5

Views: 725

Answers (1)

DBaker
DBaker

Reputation: 2139

Spacy provides Rule-based matching which you could use.

They can be used as follows:

import spacy
from spacy.pipeline import EntityRuler
nlp = spacy.load('en_core_web_sm', disable=["ner", "parser"])

countries = ['Czech Republic', 'Slovakia']
ruler = EntityRuler(nlp)
for a in countries:
    ruler.add_patterns([{"label": "country", "pattern": a}])
nlp.add_pipe(ruler)


doc = nlp("Czech Republic won 5 gold medals at olympics. Slovakia won 0 medals olympics")

with doc.retokenize() as retokenizer:
    for ent in doc.ents:
        retokenizer.merge(doc[ent.start:ent.end])


from spacy.matcher import Matcher
matcher = Matcher(nlp.vocab)
pattern =[{'ENT_TYPE': 'country'}, {'lower': 'won'},{"IS_DIGIT": True}]
matcher.add('medal', None, pattern)
matches = matcher(doc)


for match_id, start, end in matches:
    span = doc[start:end]
    print(span)

output:

Czech Republic won 5
Slovakia won 0

The above code should get you started. Naturally, you will have to write your own more complex rules so that you can handle cases like: "Czech Republic unsurprisingly won 5 gold medals at olympics in 1995." And other more complex sentence structures.

Upvotes: 3

Related Questions