Reputation: 59
I am following chapter 6 from the book Natural Language Processing with Python (http://www.nltk.org/book/ch06.html)
I am trying to replicate the experiment of sentence segmentation with the cess_esp corpus. I follow the code line by line and it seems to work until I try to use it to segment a text of my own.
>>> import nltk
>>> from nltk.corpus import cess_esp
>>> sentences = cess_esp.sents()
>>> tokens = []
>>> boundaries = set()
>>> offset = 0
>>> for sent in sentences:
tokens.extend(sent)
offset += len(sent)
boundaries.add(offset-1)
>>> def punct_features(tokens,i):
return {'next-word-capitalized': tokens[i+1][0].isupper(),
'prev-word': tokens[i-1].lower(),
'punct': tokens[i],
'prev-word-is-one-char': len(tokens[i-1]) == 1}
>>> featureset = [(punct_features(tokens, i), (i in boundaries))
for i in range(1, len(tokens)-1)
if tokens[i] in '.?!']
>>> size = int(len(featureset) * 0.1)
>>> train_set, test_set = featureset[size:], featureset[:size]
>>> classifier = nltk.NaiveBayesClassifier.train(train_set)
>>> nltk.classify.accuracy(classifier, test_set)
0.9983388704318937
So far so good. But when I try to use the function to segment my text I get an error.
def segment_sentences(words):
start = 0
sents = []
for i, word in enumerate(words):
if word in '.?!' and classifier.classify(punct_features(words, i)) == True:
sents.append(words[start:i+1])
start = i+1
if start < len(words):
sents.append(words[start:])
return sents
new_text = ['En', 'un', 'lugar', 'de', 'la', 'Mancha', ',', 'de', 'cuyo', 'nombre', 'no', 'quiero', 'acordarme', ',', 'no', 'ha', 'mucho', 'tiempo', 'que', 'vivía', 'un', 'hidalgo', 'de', 'los', 'de', 'lanza', 'en', 'astillero', ',', 'adarga', 'antigua', ',', 'rocín', 'flaco', 'y', 'galgo', 'corredor', '.', 'Una', 'olla', 'de', 'algo', 'más', 'vaca', 'que', 'carnero', ',', 'salpicón', 'las', 'más', 'noches', ',', 'duelos', 'y', 'quebrantos', 'los', 'sábados', ',', 'lantejas', 'los', 'viernes', ',', 'algún', 'palomino', 'de', 'añadidura', 'los', 'domingos', ',', 'consumían', 'las', 'tres', 'partes', 'de', 'su', 'hacienda', '.', 'El', 'resto', 'della', 'concluían', 'sayo', 'de', 'velarte', ',', 'calzas', 'de', 'velludo', 'para', 'las', 'fiestas', ',', 'con', 'sus', 'pantuflos', 'de', 'lo', 'mesmo', ',', 'y', 'los', 'días', 'de', 'entresemana', 'se', 'honraba', 'con', 'su', 'vellorí', 'de', 'lo', 'más', 'fino', '.']
segment_sentences(new_text)
Traceback (most recent call last):
File "<pyshell#31>", line 1, in <module>
segment_sentences(texto)
File "<pyshell#26>", line 5, in segment_sentences
if word in '.?!' and classifier.classify(punct_features(words, i)) == True:
File "<pyshell#16>", line 2, in punct_features
return {'next-word-capitalized': tokens[i+1][0].isupper(),
IndexError: list index out of range
I have been tweaking some numbers to see if I could fix the index out of range error, but it didn't work.
Any help is appreciated
Upvotes: 2
Views: 296
Reputation: 1059
It looks like you need to loop over enumerate(words[:-1])
instead of enumerate(words)
.
As you've written it, you are calling punct_features(words, i)
on the last word in the list. When the index of the last word in the list (i
) is passed to punct_features()
you then try to access words[i+1]
(as tokens[i+1]
. Since there are only i
items in words
, you get an IndexError
.
Upvotes: 2