CraigH
CraigH

Reputation: 2061

How to extract relationship from text in NLTK

Hi I'm trying to extract relationships from a string of text based on the second last example here: https://web.archive.org/web/20120907184244/http://nltk.googlecode.com/svn/trunk/doc/howto/relextract.html

From a string such as "Michael James editor of Publishers Weekly" my desired result is to have an output such as:

[PER: 'Michael James'] ', editor of' [ORG: 'Publishers Weekly']

What is the best way to do do this? What format does extract_rels expect and how do I format my input to meet that requirement?


Tried to do it myself but it didn't work. Here is the code I've adapted from the book. I'm not getting any results printed. What am I doing wrong?

class doc():
 pass

doc.headline = ['this is expected by nltk.sem.extract_rels but not used in this script']

def findrelations(text):
roles = """
(.*(                   
analyst|
editor|
librarian).*)|
researcher|
spokes(wo)?man|
writer|
,\sof\sthe?\s*  # "X, of (the) Y"
"""
ROLES = re.compile(roles, re.VERBOSE)
tokenizedsentences = nltk.sent_tokenize(text)
for sentence in tokenizedsentences:
    taggedwords  = nltk.pos_tag(nltk.word_tokenize(sentence))
    doc.text = nltk.batch_ne_chunk(taggedwords)
    print doc.text
    for rel in relextract.extract_rels('PER', 'ORG', doc, corpus='ieer', pattern=ROLES):
        print relextract.show_raw_rtuple(rel) # doctest: +ELLIPSIS

text ="Michael James editor of Publishers Weekly"

findrelations(text)

Upvotes: 8

Views: 4073

Answers (1)

Vinicius Woloszyn
Vinicius Woloszyn

Reputation: 314

here a code based on yours (just few adjusts) that work well ;)

import nltk
import re 
from nltk.chunk import ne_chunk_sents
from nltk.sem import relextract


def findrelations(text):
    roles = """
    (.*(                   
    analyst|
    editor|
    librarian).*)|
    researcher|
    spokes(wo)?man|
    writer|
    ,\sof\sthe?\s*  # "X, of (the) Y"
    """
    ROLES = re.compile(roles, re.VERBOSE)

    sentences = nltk.sent_tokenize(text)
    tokenized_sentences = [nltk.word_tokenize(sentence) for sentence in sentences]
    tagged_sentences = [nltk.pos_tag(sentence) for sentence in tokenized_sentences]
    chunked_sentences = nltk.ne_chunk_sents(tagged_sentences)


    for doc in chunked_sentences:
        print doc
        for rel in relextract.extract_rels('PER', 'ORG', doc, corpus='ace', pattern=ROLES):
            #it is a tree, so you need to work on it to output what you want
            print relextract.show_raw_rtuple(rel) 

findrelations('Michael James editor of Publishers Weekly')

(S (PERSON Michael/NNP) (PERSON James/NNP) editor/NN of/IN (ORGANIZATION Publishers/NNS Weekly/NNP))

Upvotes: 4

Related Questions