Reputation: 21
I'm currently working on a sentiment analysis project using nltk in python. I can't get my script to pass in rows of text from my csv to perform tokenization on. However, if I pass the text in one entry at a time it works fine. I am getting one persistent error: 'TypeError: expected string or bytes-like object' when I try and pass the whole csv in. Here is the printed data frame and python code I'm using. Any help to resolve this issue would be great.
abstract
0 Allergic diseases are often triggered by envir...
1 omal lymphopoietin (TSLP) has important roles ...
2 of atrial premature beats, and a TSLP was high...
3 deposition may play an important role in the ...
4 ted by TsPLP was higher than that mediated by ...
5 nal Stat5 transcription factor in that TSLP st...
data = pd.read_csv('text.csv', sep=';', encoding = 'utf-8')
x = data.loc[:, 'abstract']
print(x.head())
tokens = nltk.word_tokenize(x)
print(tokens)
Attached is the full stack trace error. EDIT: print statement
EDIT: Output
Upvotes: 2
Views: 7271
Reputation: 3503
The nltk documentation give an example of nltk.word_tokenize
usage where you may notice "sentence" is a string
.
In your situation, x
is a dataframe Series
(of strings), which you need to reconstruct into a string before passing it to nltk.word_tokenize
.
One way to deal with this is to create your nltk
"sentence" from x
:
x = data.loc[:, 'abstract']
sentence=' '.join(x)
tokens = nltk.word_tokenize(sentence)
EDIT:
Try this as per further comments (remember this will be a Series
of tokens to be accessed accordingly):
tokens=x.apply(lambda sentence: nltk.word_tokenize(sentence))
Upvotes: 1