Superdooperhero
Superdooperhero

Reputation: 8096

Spacy "is a" mining

I would like to use Spacy matchers to mine "is a" (and other) relationships from Wikipedia in order to build a knowledge database.

I have the following code:

nlp = spacy.load("en_core_web_lg")
text = u"""Garfield is a large comic strip cat that lives in Ohio. Cape Town is the oldest city in South Africa."""
doc = nlp(text)
sentence_spans = list(doc.sents)
# Write a pattern
pattern = [
    {"POS": "PROPN", "OP": "+"}, 
    {"LEMMA": "be"}, 
    {"POS": "DET"}, 
    {"POS": "ADJ", "OP": "*"}, 
    {"POS": "NOUN", "OP": "+"}
]   

# Add the pattern to the matcher and apply the matcher to the doc
matcher.add("IS_A_PATTERN", None, pattern)
matches = matcher(doc)

# Iterate over the matches and print the span text
for match_id, start, end in matches:
    print("Match found:", doc[start:end].text)

Unfortunately this matches:

Match found: Garfield is a large comic strip
Match found: Garfield is a large comic strip cat
Match found: Town is the oldest city
Match found: Cape Town is the oldest city

whereas I just want:

Match found: Garfield is a large comic strip cat
Match found: Cape Town is the oldest city

In addition I wouldn't mind being able to state that the first part of the match must be the subject of the sentence and the last part the predicate.

I would also like to return this separated in this manner:

['Garfield', 'is a', 'large comic strip cat', 'comic strip cat']
['Cape Town', 'is the', 'oldest city', 'city']

So that I can get a list of cities.

Is any of this possible in Spacy or what would the equivalent Python code be?

Upvotes: 2

Views: 231

Answers (3)

Superdooperhero
Superdooperhero

Reputation: 8096

Managed to do it with this code:

doc = nlp("Cape Town (Afrikaans: Kaapstad, Dutch: Kapstadt) is the oldest city in the south west of South Africa.")
for chunk in doc.noun_chunks:
    if chunk.root.dep_ == 'nsubj' and chunk.root.head.text == 'is':
        subject_name = chunk.text
    elif chunk.root.dep_ == 'attr' and chunk.root.head.text == 'is':
        attr_full = chunk.text 
        attr_type = chunk.root.text
print("{:<25}{:<25}{}".format(subject_name, attr_full, attr_type))    

which prints:

Cape Town                the oldest city          city

Upvotes: 0

David Dale
David Dale

Reputation: 11424

I think you need some syntactic analysis here. From syntactic point of view, your sentences look like

                   is                             
    _______________|_____                          
   |      |             cat                       
   |      |    __________|________________         
   |      |   |    |     |     |        lives     
   |      |   |    |     |     |     _____|____    
   |      |   |    |     |     |    |          in 
   |      |   |    |     |     |    |          |   
Garfield  .   a  large comic strip that       Ohio

          is              
  ________|____            
 |   |        city        
 |   |     ____|______     
 |   |    |    |      in  
 |   |    |    |      |    
 |  Town  |    |    Africa
 |   |    |    |      |    
 .  Cape the oldest South 

(I used the method from this question to plot the trees).

Now, instead of extracting substrings, you should extract subtrees. A minimal code to achieve this would first find "is a" pattern, and then yield the left and the right subtrees, if they are attached to the "is a" with a right sort of dependencies:

def get_head(sentence):
    toks = [t for t in sentence]
    for i, t in enumerate(toks):
        if t.lemma_ == 'be' and i + 1 < len(toks) and toks[i+1].pos_ == 'DET':
            yield t

def get_relations(text):
    doc = nlp(text)
    for sent in doc.sents:
        for head in get_head(sent):
            children = list(head.children)
            if len(children) < 2:
                continue
            l, r = children[0:2]
            # check that the left child is really a subject and the right one is a description
            if l.dep_ == 'nsubj' and r.dep_ == 'attr':
                yield l, r

for l, r in get_relations(text):
    print(list(l.subtree), list(r.subtree))

It would output something like

[Garfield] [a, large, comic, strip, cat, that, lives, in, Ohio]
[Cape, Town] [the, oldest, city, in, South, Africa]

So you at least separate the left part from the right part correctly. If you want, you can add more filters (e.g. that l.pos_ == 'PROPN'). Another improvement would be to handle cases with more then 2 children of "is" (e.g. adverbs).

Now, you can prune the subtrees as you like, producing even smaller predicates (like "large cat", "comic cat", "strip cat", "cat that lives in Ohio" etc). A quick-and-dirty version of such pruning could look every time at only one child:

for l, r in get_relations(text):
    print(list(l.subtree), list(r.subtree))
    for c in r.children:
        words = [r] + list(c.subtree)
        print(' '.join([w.text for w in sorted(words, key=lambda x: x.i)]))

It would produce the following result

[Garfield], [a, large, comic, strip, cat, that, lives, in, Ohio]
a cat
large cat
comic cat
strip cat
cat that lives in Ohio
[Cape, Town], [the, oldest, city, in, South, Africa]
the city
oldest city
city in South Africa

You see that some subtrees are wrong: Cape Town is not the "oldest city" globally. But it seems that you need at least some semantic knowledge to filter out such incorrect subtrees.

Upvotes: 2

ashutosh singh
ashutosh singh

Reputation: 533

I think this is because of partial matches. The regex is giving all possible matches for your pattern which includes a sub-string too. In case of Cape Town is the oldest city and Town is the oldest city both satisfy the condition of your pattern.

Either you can filter out sub-strings or one other method would be chunk you nouns and replace them with a specific tag and then apply the pattern. For eg.

sentence = Cape Town is the oldest city
noun_chunked_sentence = Cape_Town is the oldest_city

After this you can apply the same pattern and it should work.

Upvotes: -1

Related Questions