Draconis
Draconis

Reputation: 3461

NLTK AssertionError when taking sentences from PlaintextCorpusReader

I'm using a PlaintextCorpusReader to work with some files from Project Gutenberg. It seems to handle word tokenization without issue, but chokes when I request sentences or paragraphs.

I start by downloading a Gutenberg book (in UTF-8 plaintext) to the current directory. Then:

>>> from nltk.corpus import PlaintextCorpusReader
>>> r = PlaintextCorpusReader('.','Dracula.txt')
>>> r.words()
['DRACULA', 'CHAPTER', 'I', 'JONATHAN', 'HARKER', "'", ...]
>>> r.sents()
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/usr/local/lib/python3.5/dist-packages/nltk/util.py", line 765, in __repr__
    for elt in self:
  File "/usr/local/lib/python3.5/dist-packages/nltk/corpus/reader/util.py", line 296, in iterate_from
    new_filepos = self._stream.tell()
  File "/usr/local/lib/python3.5/dist-packages/nltk/data.py", line 1333, in tell
    assert check1.startswith(check2) or check2.startswith(check1)
AssertionError

I've tried modifying the book in various ways: stripping off the header, removing newlines, adding a period to the end to finish the last "sentence". The error remains. Am I doing something wrong? Or am I running up against some limitation in NLTK?

(Running Python 3.5.0, NLTK 3.2.1, on Ubuntu. Problem appears in other Python 3.x versions as well.)

EDIT: Introspection shows the following locals at the point of exception.

>>> pprint.pprint(inspect.trace()[-1][0].f_locals)
{'buf_size': 63,
 'bytes_read': 75,
 'check1': "\n\n\n CHAPTER I\n\nJONATHAN HARKER'S JOURNAL\n\n(_Kept i",
 'check2': '\n'
           '\n'
           ' CHAPTER I\n'
           '\n'
           "JONATHAN HARKER'S JOURNAL\n"
           '\n'
           '(_Kept in shorthand._)',
 'est_bytes': 9,
 'filepos': 11,
 'orig_filepos': 75,
 'self': <nltk.data.SeekableUnicodeStreamReader object at 0x7fd2694b90f0>}

In other words, check1 is losing an initial newline somehow.

Upvotes: 3

Views: 851

Answers (2)

Jubaer Hossain
Jubaer Hossain

Reputation: 11

in encoding use "utf-8-sig" instead of "utf8" which is default...

Upvotes: 0

Draconis
Draconis

Reputation: 3461

That particular file has a UTF-8 Byte Order Mark (EF BB BF) at the start, which is confusing NLTK. Removing those bytes manually, or copy-pasting the entire text into a new file, fixes the problem.

I'm not sure why NLTK can't handle BOMs, but at least there's a solution.

Upvotes: 4

Related Questions