Cindy Rico
Cindy Rico

Reputation: 31

First and Last name tagged as one token by using Stanford NER with Python

I'm using the Stanford Named Entity Recognizer with Python to find the proper names in the novel "A Hundred years of solitud". There are many of them composed by first and last name e.g. "Aureliano Buendía" or "Santa Sofía de la Piedad". These Tokens are always separated e.g. "Aureliano" "Buendia", because of the tokenizer I am using. I would like to have them together as a token, so they can be tagged together as "PERSON" with Stanford NER.

The code I wrote:

import nltk

from nltk.tag import StanfordNERTagger

from nltk import word_tokenize

from nltk import FreqDist

sentence1 = open('book1.txt').read()

sentence = sentence1.split()

path_to_model = "C:\Python34\stanford-ner-2015-04-20\classifiers\english.muc.7class.distsim.crf.ser"

path_to_jar = "C:\Python34\stanford-ner-2015-04-20\stanford-ner.jar"

st = StanfordNERTagger(model_filename=path_to_model, path_to_jar=path_to_jar)

taggedSentence = st.tag(sentence)

def findtags (tagged_text,tag_prefix):

    cfd = nltk.ConditionalFreqDist((tag, word) for (word, tag) in taggedSentence

                                   if tag.endswith(tag_prefix))

    return dict((tag, cfd[tag].most_common(1000)) for tag in cfd.conditions())


print (findtags('_','PERSON'))

The result looks like this:

{'PERSON': [('Aureliano', 397), ('José', 294), ('Arcadio', 286), ('Buendía', 251), ...

Does anybody have a solution? I would be more than grateful

Upvotes: 3

Views: 952

Answers (1)

Surya Pratap
Surya Pratap

Reputation: 1

import nltk

from nltk.tag import StanfordNERTagger

sentence1 = open('book1.txt').read()

sentence = sentence1.split()

path_to_model = "C:\Python34\stanford-ner-2015-04-20\classifiers\english.muc.7class.distsim.crf.ser"

path_to_jar = "C:\Python34\stanford-ner-2015-04-20\stanford-ner.jar"

st = StanfordNERTagger(model_filename=path_to_model, path_to_jar=path_to_jar)

taggedSentence = st.tag(sentence)

test = [] 

test_dict = {}

for element in range(len(taggedSentence)):

    a = ''

    if element < len(taggedSentence):
       while taggedSentence[element][1] == 'PERSON':
          a += taggedSentence[element][0] + ' '
          taggedSentence.pop(element)
          if len(a) > 1:
             test.append(a.strip())

test_dict[data.split('.')[0]] = tuple(test)

print(test_dict)

Upvotes: 0

Related Questions