Reputation: 13
I also tried .apply(str) and .astype(str) before tokenization, yet I get TypeError: expected string or bytes-like object.
data.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 8 entries, 0 to 7
Data columns (total 3 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 tag 8 non-null object
1 clean_patterns 8 non-null object
2 clean_responses 8 non-null object
dtypes: object(3)
memory usage: 320.0+ bytes
I am trying to word_tokenize the data for the NLP chatbot.
print(word_tokenize(data))
TypeError Traceback (most recent call last) in ----> 1 print(word_tokenize(data))
D:\anaconda\lib\site-packages\nltk\tokenize_init_.py in word_tokenize(text, language, preserve_line) 128 :type preserve_line: bool 129 """ --> 130 sentences = [text] if preserve_line else sent_tokenize(text, language) 131 return [ 132 token for sent in sentences for token in _treebank_word_tokenizer.tokenize(sent)
D:\anaconda\lib\site-packages\nltk\tokenize_init_.py in sent_tokenize(text, language) 106 """ 107 tokenizer = load("tokenizers/punkt/{0}.pickle".format(language)) --> 108 return tokenizer.tokenize(text) 109 110
D:\anaconda\lib\site-packages\nltk\tokenize\punkt.py in tokenize(self, text, realign_boundaries) 1272 Given a text, returns a list of the sentences in that text. 1273 """ -> 1274 return list(self.sentences_from_text(text, realign_boundaries)) 1275 1276 def debug_decisions(self, text):
D:\anaconda\lib\site-packages\nltk\tokenize\punkt.py in sentences_from_text(self, text, realign_boundaries) 1326
follows the period. 1327 """ -> 1328 return [text[s:e] for s, e in self.span_tokenize(text, realign_boundaries)] 1329 1330 def _slices_from_text(self, text):D:\anaconda\lib\site-packages\nltk\tokenize\punkt.py in (.0) 1326 follows the period. 1327 """ -> 1328 return [text[s:e] for s, e in self.span_tokenize(text, realign_boundaries)] 1329 1330 def _slices_from_text(self, text):
D:\anaconda\lib\site-packages\nltk\tokenize\punkt.py in span_tokenize(self, text, realign_boundaries) 1316 if realign_boundaries: 1317 slices = self._realign_boundaries(text, slices) -> 1318 for sl in slices: 1319 yield (sl.start, sl.stop) 1320
D:\anaconda\lib\site-packages\nltk\tokenize\punkt.py in _realign_boundaries(self, text, slices) 1357 """ 1358 realign = 0 -> 1359 for sl1, sl2 in _pair_iter(slices): 1360 sl1 = slice(sl1.start + realign, sl1.stop) 1361 if not sl2:
D:\anaconda\lib\site-packages\nltk\tokenize\punkt.py in _pair_iter(it) 314 it = iter(it) 315 try: --> 316 prev = next(it) 317 except StopIteration: 318 return
D:\anaconda\lib\site-packages\nltk\tokenize\punkt.py in _slices_from_text(self, text) 1330 def _slices_from_text(self, text): 1331 last_break = 0 -> 1332 for match in self._lang_vars.period_context_re().finditer(text): 1333
context = match.group() + match.group("after_tok") 1334
if self.text_contains_sentbreak(context):TypeError: expected string or bytes-like object
Upvotes: 0
Views: 1693
Reputation: 2056
Welcome to SO ;)
Given the following dataframe data
and the function word_tokenize
you must do
import pandas as pd
def word_tokenize(sentence):
return sentence.split()
data = pd.DataFrame(data={'col1': ['bar bar bar foo',
'foo foo foo bar', 124],
'col2': [12, 13, 14]})
applying the function on col1
in the simplest way possible
df['col1'].astype(str).apply(word_tokenize)
#ouput
0 [bar, bar, bar, foo]
1 [foo, foo, foo, bar]
2 [124]
Name: col1, dtype: objec
first changing type to str
second apply the function on every single element. The output would be pandas.core.series.Series
Upvotes: 1