NLP: Tokenize : TypeError: expected string or bytes-like object

Question

I also tried .apply(str) and .astype(str) before tokenization, yet I get TypeError: expected string or bytes-like object.

data.info()


RangeIndex: 8 entries, 0 to 7
Data columns (total 3 columns):
 #   Column           Non-Null Count  Dtype 
---  ------           --------------  ----- 
 0   tag              8 non-null      object
 1   clean_patterns   8 non-null      object
 2   clean_responses  8 non-null      object
dtypes: object(3)
memory usage: 320.0+ bytes

I am trying to word_tokenize the data for the NLP chatbot.

print(word_tokenize(data))

TypeError Traceback (most recent call last) in ----> 1 print(word_tokenize(data))

D:\anaconda\lib\site-packages ltk okenize_init_.py in word_tokenize(text, language, preserve_line) 128 :type preserve_line: bool 129 """ --> 130 sentences = [text] if preserve_line else sent_tokenize(text, language) 131 return [ 132 token for sent in sentences for token in _treebank_word_tokenizer.tokenize(sent)

D:\anaconda\lib\site-packages ltk okenize_init_.py in sent_tokenize(text, language) 106 """ 107 tokenizer = load("tokenizers/punkt/{0}.pickle".format(language)) --> 108 return tokenizer.tokenize(text) 109 110

D:\anaconda\lib\site-packages ltk okenize\punkt.py in tokenize(self, text, realign_boundaries) 1272 Given a text, returns a list of the sentences in that text. 1273 """ -> 1274 return list(self.sentences_from_text(text, realign_boundaries)) 1275 1276 def debug_decisions(self, text):

D:\anaconda\lib\site-packages ltk okenize\punkt.py in sentences_from_text(self, text, realign_boundaries) 1326
follows the period. 1327 """ -> 1328 return [text[s:e] for s, e in self.span_tokenize(text, realign_boundaries)] 1329 1330 def _slices_from_text(self, text):

D:\anaconda\lib\site-packages ltk okenize\punkt.py in (.0) 1326 follows the period. 1327 """ -> 1328 return [text[s:e] for s, e in self.span_tokenize(text, realign_boundaries)] 1329 1330 def _slices_from_text(self, text):

D:\anaconda\lib\site-packages ltk okenize\punkt.py in span_tokenize(self, text, realign_boundaries) 1316 if realign_boundaries: 1317 slices = self._realign_boundaries(text, slices) -> 1318 for sl in slices: 1319 yield (sl.start, sl.stop) 1320

D:\anaconda\lib\site-packages ltk okenize\punkt.py in _realign_boundaries(self, text, slices) 1357 """ 1358 realign = 0 -> 1359 for sl1, sl2 in _pair_iter(slices): 1360 sl1 = slice(sl1.start + realign, sl1.stop) 1361 if not sl2:

D:\anaconda\lib\site-packages ltk okenize\punkt.py in _pair_iter(it) 314 it = iter(it) 315 try: --> 316 prev = next(it) 317 except StopIteration: 318 return

D:\anaconda\lib\site-packages ltk okenize\punkt.py in _slices_from_text(self, text) 1330 def _slices_from_text(self, text): 1331 last_break = 0 -> 1332 for match in self._lang_vars.period_context_re().finditer(text): 1333
context = match.group() + match.group("after_tok") 1334
if self.text_contains_sentbreak(context):

TypeError: expected string or bytes-like object

Meti · Accepted Answer

Welcome to SO ;) Given the following dataframe data and the function word_tokenize you must do

import pandas as pd
def word_tokenize(sentence):
  return sentence.split()
data = pd.DataFrame(data={'col1': ['bar bar bar foo', 
                                 'foo foo foo bar', 124],
                        'col2': [12, 13, 14]})

applying the function on col1 in the simplest way possible

df['col1'].astype(str).apply(word_tokenize)
#ouput
0    [bar, bar, bar, foo]
1    [foo, foo, foo, bar]
2                   [124]
Name: col1, dtype: objec

first changing type to str second apply the function on every single element. The output would be pandas.core.series.Series

NLP: Tokenize : TypeError: expected string or bytes-like object

Answers (1)

Related Questions