NLTK AssertionError when taking sentences from PlaintextCorpusReader

Question

I'm using a PlaintextCorpusReader to work with some files from Project Gutenberg. It seems to handle word tokenization without issue, but chokes when I request sentences or paragraphs.

I start by downloading a Gutenberg book (in UTF-8 plaintext) to the current directory. Then:

>>> from nltk.corpus import PlaintextCorpusReader
>>> r = PlaintextCorpusReader('.','Dracula.txt')
>>> r.words()
['DRACULA', 'CHAPTER', 'I', 'JONATHAN', 'HARKER', "'", ...]
>>> r.sents()
Traceback (most recent call last):
  File "", line 1, in 
  File "/usr/local/lib/python3.5/dist-packages/nltk/util.py", line 765, in __repr__
    for elt in self:
  File "/usr/local/lib/python3.5/dist-packages/nltk/corpus/reader/util.py", line 296, in iterate_from
    new_filepos = self._stream.tell()
  File "/usr/local/lib/python3.5/dist-packages/nltk/data.py", line 1333, in tell
    assert check1.startswith(check2) or check2.startswith(check1)
AssertionError

I've tried modifying the book in various ways: stripping off the header, removing newlines, adding a period to the end to finish the last "sentence". The error remains. Am I doing something wrong? Or am I running up against some limitation in NLTK?

(Running Python 3.5.0, NLTK 3.2.1, on Ubuntu. Problem appears in other Python 3.x versions as well.)

EDIT: Introspection shows the following locals at the point of exception.

>>> pprint.pprint(inspect.trace()[-1][0].f_locals)
{'buf_size': 63,
 'bytes_read': 75,
 'check1': "


 CHAPTER I

JONATHAN HARKER'S JOURNAL

(_Kept i",
 'check2': '
'
           '
'
           ' CHAPTER I
'
           '
'
           "JONATHAN HARKER'S JOURNAL
"
           '
'
           '(_Kept in shorthand._)',
 'est_bytes': 9,
 'filepos': 11,
 'orig_filepos': 75,
 'self': }

In other words, check1 is losing an initial newline somehow.

Draconis · Accepted Answer

That particular file has a UTF-8 Byte Order Mark (EF BB BF) at the start, which is confusing NLTK. Removing those bytes manually, or copy-pasting the entire text into a new file, fixes the problem.

I'm not sure why NLTK can't handle BOMs, but at least there's a solution.

NLTK AssertionError when taking sentences from PlaintextCorpusReader

Answers (2)

Related Questions