Reputation: 3166
I have began to experiment with Python and NLTK. I am experiencing a lengthy error message which I cannot find a solution to and would appreciate any insights you may have.
import nltk,csv,numpy
from nltk import sent_tokenize, word_tokenize, pos_tag
reader = csv.reader(open('Medium_Edited.csv', 'rU'), delimiter= ",",quotechar='|')
tokenData = nltk.word_tokenize(reader)
I'm running Python 2.7 and the latest nltk package on OSX Yosemite. These are also two lines of code I attempted with no difference in results:
with open("Medium_Edited.csv", "rU") as csvfile:
tokenData = nltk.word_tokenize(reader)
These are the error messages I see:
Traceback (most recent call last):
File "nltk_text.py", line 11, in <module>
tokenData = nltk.word_tokenize(reader)
File "/Library/Python/2.7/site-packages/nltk/tokenize/__init__.py", line 101, in word_tokenize
return [token for sent in sent_tokenize(text, language)
File "/Library/Python/2.7/site-packages/nltk/tokenize/__init__.py", line 86, in sent_tokenize
return tokenizer.tokenize(text)
File "/Library/Python/2.7/site-packages/nltk/tokenize/punkt.py", line 1226, in tokenize
return list(self.sentences_from_text(text, realign_boundaries))
File "/Library/Python/2.7/site-packages/nltk/tokenize/punkt.py", line 1274, in sentences_from_text
return [text[s:e] for s, e in self.span_tokenize(text, realign_boundaries)]
File "/Library/Python/2.7/site-packages/nltk/tokenize/punkt.py", line 1265, in span_tokenize
return [(sl.start, sl.stop) for sl in slices]
File "/Library/Python/2.7/site-packages/nltk/tokenize/punkt.py", line 1304, in _realign_boundaries
for sl1, sl2 in _pair_iter(slices):
File "/Library/Python/2.7/site-packages/nltk/tokenize/punkt.py", line 310, in _pair_iter
prev = next(it)
File "/Library/Python/2.7/site-packages/nltk/tokenize/punkt.py", line 1278, in _slices_from_text
for match in self._lang_vars.period_context_re().finditer(text):
TypeError: expected string or buffer
Thanks in advance
Upvotes: 3
Views: 20932
Reputation:
It is giving error - expected string or buffer because you have forgotten to add str as
tokenData = nltk.word_tokenize(str(reader))
Upvotes: 0
Reputation: 3099
As you can read in the Python csv documentation, csv.reader
"returns a reader object which will iterate over lines in the given csvfile". In other words, if you want to tokenize the text in your csv file, you will have to go through the lines and the fields in those lines:
for line in reader:
for field in line:
tokens = word_tokenize(field)
Also, when you import word_tokenize
at the beginning of your script, you should call it as word_tokenize
, and not as nltk.word_tokenize
. This also means you can drop the import nltk
statement.
Upvotes: 3