NE tags in NLTK ConllCorpusReader

Question

I'm trying to use CoNLLCorpusReader for CoNLL2003 dataset. This dataset contains 4 columns (example):

WORD      POS   CHUNK NE
U.N.      NNP   I-NP  I-ORG
official  NN    I-NP  O
Ekeus     NNP   I-NP  I-PER
heads     VBZ   I-VP  O
for       IN    I-PP  O
Baghdad   NNP   I-NP  I-LOC
.         .     O     O

I create corpus and it works - I can get words, sents and tuples with pos tags and chunk tags.

The question is, how can i get Named Entity tags from my corpus? I know there is corpus.raw() method, but is there really no way to get it with something like corpus.iob_words()? I found this issue: https://github.com/nltk/nltk/issues/63, but in the latest version of this corpusReader there are no additional arguments in iob_words method that I can use to change the list of columns I want to get.

alexis · Accepted Answer

Looks like may have to help yourself. Give this a try, I think it is all you need to extend ConllCorpusReader so that iob_words() can be told to select the NE column instead of the (default) CHUNK column. iob_sents() chunked_words() and chunked_sents() ought to be similarly modified.

class betterConllReader(ConllCorpusReader):

    def iob_words(self, fileids=None, tagset=None, column="chunk"):
        """
        :return: a list of word/tag/IOB tuples
        :rtype: list(tuple)
        :param fileids: the list of fileids that make up this corpus
        :type fileids: None or str or list
        """
        self._require(self.WORDS, self.POS, self.CHUNK)
        def get_iob_words(grid):
            return self._get_iob_words(grid, tagset, column)
        return LazyConcatenation(LazyMap(get_iob_words, self._grids(fileids)))

    def _get_iob_words(self, grid, tagset=None, column="chunk"):
        pos_tags = self._get_column(grid, self._colmap['pos'])
        if tagset and tagset != self._tagset:
            pos_tags = [map_tag(self._tagset, tagset, t) for t in pos_tags]
        return list(zip(self._get_column(grid, self._colmap['words']), pos_tags,
                   self._get_column(grid, self._colmap[column])))

All I did was replace the hardcoded "chunk" with a keyword argument. With a little more work, multiple columns could be selected (reasonable with iob_*(), less clearly so for the chunked_*() variants.)

NE tags in NLTK ConllCorpusReader

Answers (1)

Related Questions