Reputation: 8096
I would like to use Spacy matchers to mine "is a" (and other) relationships from Wikipedia in order to build a knowledge database.
I have the following code:
nlp = spacy.load("en_core_web_lg")
text = u"""Garfield is a large comic strip cat that lives in Ohio. Cape Town is the oldest city in South Africa."""
doc = nlp(text)
sentence_spans = list(doc.sents)
# Write a pattern
pattern = [
{"POS": "PROPN", "OP": "+"},
{"LEMMA": "be"},
{"POS": "DET"},
{"POS": "ADJ", "OP": "*"},
{"POS": "NOUN", "OP": "+"}
]
# Add the pattern to the matcher and apply the matcher to the doc
matcher.add("IS_A_PATTERN", None, pattern)
matches = matcher(doc)
# Iterate over the matches and print the span text
for match_id, start, end in matches:
print("Match found:", doc[start:end].text)
Unfortunately this matches:
Match found: Garfield is a large comic strip
Match found: Garfield is a large comic strip cat
Match found: Town is the oldest city
Match found: Cape Town is the oldest city
whereas I just want:
Match found: Garfield is a large comic strip cat
Match found: Cape Town is the oldest city
In addition I wouldn't mind being able to state that the first part of the match must be the subject of the sentence and the last part the predicate.
I would also like to return this separated in this manner:
['Garfield', 'is a', 'large comic strip cat', 'comic strip cat']
['Cape Town', 'is the', 'oldest city', 'city']
So that I can get a list of cities.
Is any of this possible in Spacy or what would the equivalent Python code be?
Upvotes: 2
Views: 231
Reputation: 8096
Managed to do it with this code:
doc = nlp("Cape Town (Afrikaans: Kaapstad, Dutch: Kapstadt) is the oldest city in the south west of South Africa.")
for chunk in doc.noun_chunks:
if chunk.root.dep_ == 'nsubj' and chunk.root.head.text == 'is':
subject_name = chunk.text
elif chunk.root.dep_ == 'attr' and chunk.root.head.text == 'is':
attr_full = chunk.text
attr_type = chunk.root.text
print("{:<25}{:<25}{}".format(subject_name, attr_full, attr_type))
which prints:
Cape Town the oldest city city
Upvotes: 0
Reputation: 11424
I think you need some syntactic analysis here. From syntactic point of view, your sentences look like
is
_______________|_____
| | cat
| | __________|________________
| | | | | | lives
| | | | | | _____|____
| | | | | | | in
| | | | | | | |
Garfield . a large comic strip that Ohio
is
________|____
| | city
| | ____|______
| | | | in
| | | | |
| Town | | Africa
| | | | |
. Cape the oldest South
(I used the method from this question to plot the trees).
Now, instead of extracting substrings, you should extract subtrees. A minimal code to achieve this would first find "is a" pattern, and then yield the left and the right subtrees, if they are attached to the "is a" with a right sort of dependencies:
def get_head(sentence):
toks = [t for t in sentence]
for i, t in enumerate(toks):
if t.lemma_ == 'be' and i + 1 < len(toks) and toks[i+1].pos_ == 'DET':
yield t
def get_relations(text):
doc = nlp(text)
for sent in doc.sents:
for head in get_head(sent):
children = list(head.children)
if len(children) < 2:
continue
l, r = children[0:2]
# check that the left child is really a subject and the right one is a description
if l.dep_ == 'nsubj' and r.dep_ == 'attr':
yield l, r
for l, r in get_relations(text):
print(list(l.subtree), list(r.subtree))
It would output something like
[Garfield] [a, large, comic, strip, cat, that, lives, in, Ohio]
[Cape, Town] [the, oldest, city, in, South, Africa]
So you at least separate the left part from the right part correctly. If you want, you can add more filters (e.g. that l.pos_ == 'PROPN'
). Another improvement would be to handle cases with more then 2 children of "is" (e.g. adverbs).
Now, you can prune the subtrees as you like, producing even smaller predicates (like "large cat", "comic cat", "strip cat", "cat that lives in Ohio" etc). A quick-and-dirty version of such pruning could look every time at only one child:
for l, r in get_relations(text):
print(list(l.subtree), list(r.subtree))
for c in r.children:
words = [r] + list(c.subtree)
print(' '.join([w.text for w in sorted(words, key=lambda x: x.i)]))
It would produce the following result
[Garfield], [a, large, comic, strip, cat, that, lives, in, Ohio]
a cat
large cat
comic cat
strip cat
cat that lives in Ohio
[Cape, Town], [the, oldest, city, in, South, Africa]
the city
oldest city
city in South Africa
You see that some subtrees are wrong: Cape Town is not the "oldest city" globally. But it seems that you need at least some semantic knowledge to filter out such incorrect subtrees.
Upvotes: 2
Reputation: 533
I think this is because of partial matches. The regex is giving all possible matches for your pattern which includes a sub-string too. In case of Cape Town is the oldest city
and Town is the oldest city
both satisfy the condition of your pattern.
Either you can filter out sub-strings or one other method would be chunk you nouns and replace them with a specific tag and then apply the pattern. For eg.
sentence = Cape Town is the oldest city
noun_chunked_sentence = Cape_Town is the oldest_city
After this you can apply the same pattern and it should work.
Upvotes: -1