Wunter
Wunter

Reputation: 59

NLTK Sentence boundary error

I am following chapter 6 from the book Natural Language Processing with Python (http://www.nltk.org/book/ch06.html)

I am trying to replicate the experiment of sentence segmentation with the cess_esp corpus. I follow the code line by line and it seems to work until I try to use it to segment a text of my own.

>>> import nltk
>>> from nltk.corpus import cess_esp
>>> sentences = cess_esp.sents()
>>> tokens = []
>>> boundaries = set()
>>> offset = 0
>>> for sent in sentences:
        tokens.extend(sent)
        offset += len(sent)
        boundaries.add(offset-1)


>>> def punct_features(tokens,i):
        return {'next-word-capitalized': tokens[i+1][0].isupper(),
            'prev-word': tokens[i-1].lower(),
            'punct': tokens[i],
            'prev-word-is-one-char': len(tokens[i-1]) == 1}

>>> featureset = [(punct_features(tokens, i), (i in boundaries))
              for i in range(1, len(tokens)-1)
              if tokens[i] in '.?!']
>>> size = int(len(featureset) * 0.1)
>>> train_set, test_set = featureset[size:], featureset[:size]
>>> classifier = nltk.NaiveBayesClassifier.train(train_set)
>>> nltk.classify.accuracy(classifier, test_set)
0.9983388704318937

So far so good. But when I try to use the function to segment my text I get an error.

def segment_sentences(words):
    start = 0
    sents = []
    for i, word in enumerate(words):
        if word in '.?!' and classifier.classify(punct_features(words, i)) == True:
            sents.append(words[start:i+1])
            start = i+1
    if start < len(words):
        sents.append(words[start:])
    return sents

new_text = ['En', 'un', 'lugar', 'de', 'la', 'Mancha', ',', 'de', 'cuyo', 'nombre', 'no', 'quiero', 'acordarme', ',', 'no', 'ha', 'mucho', 'tiempo', 'que', 'vivía', 'un', 'hidalgo', 'de', 'los', 'de', 'lanza', 'en', 'astillero', ',', 'adarga', 'antigua', ',', 'rocín', 'flaco', 'y', 'galgo', 'corredor', '.', 'Una', 'olla', 'de', 'algo', 'más', 'vaca', 'que', 'carnero', ',', 'salpicón', 'las', 'más', 'noches', ',', 'duelos', 'y', 'quebrantos', 'los', 'sábados', ',', 'lantejas', 'los', 'viernes', ',', 'algún', 'palomino', 'de', 'añadidura', 'los', 'domingos', ',', 'consumían', 'las', 'tres', 'partes', 'de', 'su', 'hacienda', '.', 'El', 'resto', 'della', 'concluían', 'sayo', 'de', 'velarte', ',', 'calzas', 'de', 'velludo', 'para', 'las', 'fiestas', ',', 'con', 'sus', 'pantuflos', 'de', 'lo', 'mesmo', ',', 'y', 'los', 'días', 'de', 'entresemana', 'se', 'honraba', 'con', 'su', 'vellorí', 'de', 'lo', 'más', 'fino', '.']

segment_sentences(new_text)
Traceback (most recent call last):
  File "<pyshell#31>", line 1, in <module>
    segment_sentences(texto)
  File "<pyshell#26>", line 5, in segment_sentences
    if word in '.?!' and classifier.classify(punct_features(words, i)) == True:
  File "<pyshell#16>", line 2, in punct_features
    return {'next-word-capitalized': tokens[i+1][0].isupper(),
IndexError: list index out of range

I have been tweaking some numbers to see if I could fix the index out of range error, but it didn't work.

Any help is appreciated

Upvotes: 2

Views: 296

Answers (1)

dmh
dmh

Reputation: 1059

It looks like you need to loop over enumerate(words[:-1]) instead of enumerate(words).

As you've written it, you are calling punct_features(words, i) on the last word in the list. When the index of the last word in the list (i) is passed to punct_features() you then try to access words[i+1] (as tokens[i+1]. Since there are only i items in words, you get an IndexError.

Upvotes: 2

Related Questions