Reputation: 43
I'm trying to use CoNLLCorpusReader for CoNLL2003 dataset. This dataset contains 4 columns (example):
WORD POS CHUNK NE
U.N. NNP I-NP I-ORG
official NN I-NP O
Ekeus NNP I-NP I-PER
heads VBZ I-VP O
for IN I-PP O
Baghdad NNP I-NP I-LOC
. . O O
I create corpus and it works - I can get words, sents and tuples with pos tags and chunk tags.
The question is, how can i get Named Entity tags from my corpus? I know there is corpus.raw() method, but is there really no way to get it with something like corpus.iob_words()? I found this issue: https://github.com/nltk/nltk/issues/63, but in the latest version of this corpusReader there are no additional arguments in iob_words method that I can use to change the list of columns I want to get.
Upvotes: 2
Views: 1434
Reputation: 50220
Looks like may have to help yourself. Give this a try, I think it is all you need to extend ConllCorpusReader
so that iob_words()
can be told to select the NE
column instead of the (default) CHUNK
column. iob_sents()
chunked_words()
and chunked_sents()
ought to be similarly modified.
class betterConllReader(ConllCorpusReader):
def iob_words(self, fileids=None, tagset=None, column="chunk"):
"""
:return: a list of word/tag/IOB tuples
:rtype: list(tuple)
:param fileids: the list of fileids that make up this corpus
:type fileids: None or str or list
"""
self._require(self.WORDS, self.POS, self.CHUNK)
def get_iob_words(grid):
return self._get_iob_words(grid, tagset, column)
return LazyConcatenation(LazyMap(get_iob_words, self._grids(fileids)))
def _get_iob_words(self, grid, tagset=None, column="chunk"):
pos_tags = self._get_column(grid, self._colmap['pos'])
if tagset and tagset != self._tagset:
pos_tags = [map_tag(self._tagset, tagset, t) for t in pos_tags]
return list(zip(self._get_column(grid, self._colmap['words']), pos_tags,
self._get_column(grid, self._colmap[column])))
All I did was replace the hardcoded "chunk"
with a keyword argument. With a little more work, multiple columns could be selected (reasonable with iob_*()
, less clearly so for the chunked_*()
variants.)
Upvotes: 1