McLawrence
McLawrence

Reputation: 5255

`nltk` CoreNLPParser: prevent splitting at hyphens in POS tagger

I am using the nltk CoreNLPParser with the Stanford NLP server for POS tagging as described in this answer.

This tagger treats words with hyphens as multiple words, for example dates like 2007-08 are tagged as CP, :, CP. However, my model uses words with hyphen as one token. Is it possible using the CoreNLPParser to prevent splitting at hyphens?

Upvotes: 1

Views: 590

Answers (1)

alvas
alvas

Reputation: 122132

TL;DR

from nltk.parse.corenlp import GenericCoreNLPParser

class CoreNLPParser(GenericCoreNLPParser):
    _OUTPUT_FORMAT = 'penn'
    parser_annotator = 'parse'

    def make_tree(self, result):
        return Tree.fromstring(result['parse'])

    def tag_sents(self, sentences, properties=None):
        """
        Tag multiple sentences.

        Takes multiple sentences as a list where each sentence is a list of
        tokens.

        :param sentences: Input sentences to tag
        :type sentences: list(list(str))
        :rtype: list(list(tuple(str, str))
        """
        # Converting list(list(str)) -> list(str)
        sentences = (' '.join(words) for words in sentences)
        if properties == None:
            properties = {'tokenize.whitespace':'true'}
        return [sentences[0] for sentences in self.raw_tag_sents(sentences, properties)]

    def tag(self, sentence, properties=None):
        """
        Tag a list of tokens.

        :rtype: list(tuple(str, str))

        >>> parser = CoreNLPParser(url='http://localhost:9000', tagtype='ner')
        >>> tokens = 'Rami Eid is studying at Stony Brook University in NY'.split()
        >>> parser.tag(tokens)
        [('Rami', 'PERSON'), ('Eid', 'PERSON'), ('is', 'O'), ('studying', 'O'), ('at', 'O'), ('Stony', 'ORGANIZATION'),
        ('Brook', 'ORGANIZATION'), ('University', 'ORGANIZATION'), ('in', 'O'), ('NY', 'O')]

        >>> parser = CoreNLPParser(url='http://localhost:9000', tagtype='pos')
        >>> tokens = "What is the airspeed of an unladen swallow ?".split()
        >>> parser.tag(tokens)
        [('What', 'WP'), ('is', 'VBZ'), ('the', 'DT'),
        ('airspeed', 'NN'), ('of', 'IN'), ('an', 'DT'),
        ('unladen', 'JJ'), ('swallow', 'VB'), ('?', '.')]
        """
        return self.tag_sents([sentence], properties)[0]

    def raw_tag_sents(self, sentences, properties=None):
        """
        Tag multiple sentences.

        Takes multiple sentences as a list where each sentence is a string.

        :param sentences: Input sentences to tag
        :type sentences: list(str)
        :rtype: list(list(list(tuple(str, str)))
        """
        default_properties = {'ssplit.isOneSentence': 'true',
                              'annotators': 'tokenize,ssplit,' }

        default_properties.update(properties or {})

        # Supports only 'pos' or 'ner' tags.
        assert self.tagtype in ['pos', 'ner']
        default_properties['annotators'] += self.tagtype
        for sentence in sentences:
            tagged_data = self.api_call(sentence, properties=default_properties)
            yield [[(token['word'], token[self.tagtype]) for token in tagged_sentence['tokens']]
                    for tagged_sentence in tagged_data['sentences']]

pos_tagger = CoreNLPParser(url='http://localhost:9000', tagtype='pos')
sent = ['My', 'birthday', 'is', 'on', '09-12-2050']
print(pos_tagger.tag(sent))

[out]:

[('My', 'PRP$'), ('birthday', 'NN'), ('is', 'VBZ'), ('on', 'IN'), ('09-12-2050', 'CD')]

In Long

See

Upvotes: 1

Related Questions