OAK
OAK

Reputation: 3166

NLTK python tokenizing a CSV file

I have began to experiment with Python and NLTK. I am experiencing a lengthy error message which I cannot find a solution to and would appreciate any insights you may have.

import nltk,csv,numpy 
from nltk import sent_tokenize, word_tokenize, pos_tag
reader = csv.reader(open('Medium_Edited.csv', 'rU'), delimiter= ",",quotechar='|')
tokenData = nltk.word_tokenize(reader)

I'm running Python 2.7 and the latest nltk package on OSX Yosemite. These are also two lines of code I attempted with no difference in results:

with open("Medium_Edited.csv", "rU") as csvfile:
tokenData = nltk.word_tokenize(reader)

These are the error messages I see:

Traceback (most recent call last):
  File "nltk_text.py", line 11, in <module>
    tokenData = nltk.word_tokenize(reader)
  File "/Library/Python/2.7/site-packages/nltk/tokenize/__init__.py", line 101, in word_tokenize
    return [token for sent in sent_tokenize(text, language)
  File "/Library/Python/2.7/site-packages/nltk/tokenize/__init__.py", line 86, in sent_tokenize
    return tokenizer.tokenize(text)
  File "/Library/Python/2.7/site-packages/nltk/tokenize/punkt.py", line 1226, in tokenize
    return list(self.sentences_from_text(text, realign_boundaries))
  File "/Library/Python/2.7/site-packages/nltk/tokenize/punkt.py", line 1274, in sentences_from_text
    return [text[s:e] for s, e in self.span_tokenize(text, realign_boundaries)]
  File "/Library/Python/2.7/site-packages/nltk/tokenize/punkt.py", line 1265, in span_tokenize
    return [(sl.start, sl.stop) for sl in slices]
  File "/Library/Python/2.7/site-packages/nltk/tokenize/punkt.py", line 1304, in _realign_boundaries
    for sl1, sl2 in _pair_iter(slices):
  File "/Library/Python/2.7/site-packages/nltk/tokenize/punkt.py", line 310, in _pair_iter
    prev = next(it)
  File "/Library/Python/2.7/site-packages/nltk/tokenize/punkt.py", line 1278, in _slices_from_text
    for match in self._lang_vars.period_context_re().finditer(text):
TypeError: expected string or buffer

Thanks in advance

Upvotes: 3

Views: 20932

Answers (2)

user7821053
user7821053

Reputation:

It is giving error - expected string or buffer because you have forgotten to add str as

tokenData = nltk.word_tokenize(str(reader))

Upvotes: 0

yvespeirsman
yvespeirsman

Reputation: 3099

As you can read in the Python csv documentation, csv.reader "returns a reader object which will iterate over lines in the given csvfile". In other words, if you want to tokenize the text in your csv file, you will have to go through the lines and the fields in those lines:

for line in reader:
    for field in line:
        tokens = word_tokenize(field)

Also, when you import word_tokenize at the beginning of your script, you should call it as word_tokenize, and not as nltk.word_tokenize. This also means you can drop the import nltk statement.

Upvotes: 3

Related Questions