Reputation: 3461
I'm using a PlaintextCorpusReader to work with some files from Project Gutenberg. It seems to handle word tokenization without issue, but chokes when I request sentences or paragraphs.
I start by downloading a Gutenberg book (in UTF-8 plaintext) to the current directory. Then:
>>> from nltk.corpus import PlaintextCorpusReader
>>> r = PlaintextCorpusReader('.','Dracula.txt')
>>> r.words()
['DRACULA', 'CHAPTER', 'I', 'JONATHAN', 'HARKER', "'", ...]
>>> r.sents()
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/usr/local/lib/python3.5/dist-packages/nltk/util.py", line 765, in __repr__
for elt in self:
File "/usr/local/lib/python3.5/dist-packages/nltk/corpus/reader/util.py", line 296, in iterate_from
new_filepos = self._stream.tell()
File "/usr/local/lib/python3.5/dist-packages/nltk/data.py", line 1333, in tell
assert check1.startswith(check2) or check2.startswith(check1)
AssertionError
I've tried modifying the book in various ways: stripping off the header, removing newlines, adding a period to the end to finish the last "sentence". The error remains. Am I doing something wrong? Or am I running up against some limitation in NLTK?
(Running Python 3.5.0, NLTK 3.2.1, on Ubuntu. Problem appears in other Python 3.x versions as well.)
EDIT: Introspection shows the following locals at the point of exception.
>>> pprint.pprint(inspect.trace()[-1][0].f_locals)
{'buf_size': 63,
'bytes_read': 75,
'check1': "\n\n\n CHAPTER I\n\nJONATHAN HARKER'S JOURNAL\n\n(_Kept i",
'check2': '\n'
'\n'
' CHAPTER I\n'
'\n'
"JONATHAN HARKER'S JOURNAL\n"
'\n'
'(_Kept in shorthand._)',
'est_bytes': 9,
'filepos': 11,
'orig_filepos': 75,
'self': <nltk.data.SeekableUnicodeStreamReader object at 0x7fd2694b90f0>}
In other words, check1 is losing an initial newline somehow.
Upvotes: 3
Views: 851
Reputation: 11
in encoding use "utf-8-sig" instead of "utf8" which is default...
Upvotes: 0
Reputation: 3461
That particular file has a UTF-8 Byte Order Mark (EF BB BF) at the start, which is confusing NLTK. Removing those bytes manually, or copy-pasting the entire text into a new file, fixes the problem.
I'm not sure why NLTK can't handle BOMs, but at least there's a solution.
Upvotes: 4