peleitor
peleitor

Reputation: 469

NLTK - Convert a chunked tree into a list (IOB tagging)

I need to perform Named Entity Recognition / Classification, and generate output in IOB tagged format.

I'm using a NLTK chunker, as delivered by NLTK-train library, but that produces a Tree, not a list of IOB tags.

def chunk_iob(list_of_words):
    nltk_tagger = nltk.data.load("taggers/conll2002_aubt.pickle")
    nltk_chunker = nltk.data.load("chunkers/conll2002_NaiveBayes.pickle")

    t = nltk_tagger.tag(list_of_words)
    print(t)
    c = nltk_chunker.parse(t)
    print(c)

and we get c as a Tree, like:

(S
  (LOC Barcelona/NC)
  (PER Juan/NC :/Fd)

...

But I am looking for something like:

Barcelona - LOC
Juan - PER
...

which is the IOB tagged list of the list_of_words parameter, in the same order as list_of_words.

How can I get that tagged list from the tree?

Upvotes: 1

Views: 9871

Answers (2)

Achal Kagwad
Achal Kagwad

Reputation: 311

Yes @bogs rightly mentioned it.

NLTK doesn’t process the data in form of tuples. So, these tuples are to be converted to trees using the method conlltags2tree() and tress can be converted back to tuples using tree2conlltags()

Upvotes: 0

bogs
bogs

Reputation: 2296

What you are looking for is tree2conlltags and its reverse conlltags2tree. Here's how it works:

from nltk import word_tokenize, pos_tag, ne_chunk
from nltk import conlltags2tree, tree2conlltags


tree = ne_chunk(pos_tag(word_tokenize("New York is my favorite city")))
print tree
# (S (GPE New/NNP York/NNP) is/VBZ my/PRP$ favorite/JJ city/NN)

iob_tags = tree2conlltags(tree)
print iob_tags
# [('New', 'NNP', u'B-GPE'), ('York', 'NNP', u'I-GPE'), ('is', 'VBZ', u'O'), ('my', 'PRP$', u'O'), ('favorite', 'JJ', u'O'), ('city', 'NN', u'O')]

tree = conlltags2tree(iob_tags)
print tree
# (S (GPE New/NNP York/NNP) is/VBZ my/PRP$ favorite/JJ city/NN)

Note that the IOB tags are in this format B-{tag} for beginning, I-{tag} for inside and O for outside.

Upvotes: 15

Related Questions