Reputation: 2067
Where am I going wrong with this? I am trying to iterate over each row of my data frame and encode the text.
data['text'] = data.apply(lambda row:
codecs(row['text'], "r", 'utf-8'), axis=1)
I get this error - why is the uft encoding affecting the part of the code, if I do not run the UTF encoding I do not get an error:
TypeError Traceback (most recent call last)
<ipython-input-101-0e1d5977a3b3> in <module>
----> 1 data['text'] = codecs(data['text'], "r", 'utf-8')
2
3 data['text'] = data.apply(lambda row:
4 codecs(row['text'], "r", 'utf-8'), axis=1)
TypeError: 'module' object is not callable
When I apply the solutions, both work however I get this error:
data['text_tokens'] = data.apply(lambda row:
nltk.word_tokenize(row['text']), axis=1)
Error:
---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
<ipython-input-138-73972d522748> in <module>
1 data['text_tokens'] = data.apply(lambda row:
----> 2 nltk.word_tokenize(row['text']), axis=1)
~/env/lib64/python3.6/site-packages/pandas/core/frame.py in apply(self, func, axis, broadcast, raw, reduce, result_type, args, **kwds)
6485 args=args,
6486 kwds=kwds)
-> 6487 return op.get_result()
6488
6489 def applymap(self, func):
~/env/lib64/python3.6/site-packages/pandas/core/apply.py in get_result(self)
149 return self.apply_raw()
150
--> 151 return self.apply_standard()
152
153 def apply_empty_result(self):
~/env/lib64/python3.6/site-packages/pandas/core/apply.py in apply_standard(self)
255
256 # compute the result using the series generator
--> 257 self.apply_series_generator()
258
259 # wrap results
~/env/lib64/python3.6/site-packages/pandas/core/apply.py in apply_series_generator(self)
284 try:
285 for i, v in enumerate(series_gen):
--> 286 results[i] = self.f(v)
287 keys.append(v.name)
288 except Exception as e:
<ipython-input-138-73972d522748> in <lambda>(row)
1 data['text_tokens'] = data.apply(lambda row:
----> 2 nltk.word_tokenize(row['text']), axis=1)
~/env/lib64/python3.6/site-packages/nltk/tokenize/__init__.py in word_tokenize(text, language, preserve_line)
142 :type preserve_line: bool
143 """
--> 144 sentences = [text] if preserve_line else sent_tokenize(text, language)
145 return [
146 token for sent in sentences for token in _treebank_word_tokenizer.tokenize(sent)
~/env/lib64/python3.6/site-packages/nltk/tokenize/__init__.py in sent_tokenize(text, language)
104 """
105 tokenizer = load('tokenizers/punkt/{0}.pickle'.format(language))
--> 106 return tokenizer.tokenize(text)
107
108
~/env/lib64/python3.6/site-packages/nltk/tokenize/punkt.py in tokenize(self, text, realign_boundaries)
1275 Given a text, returns a list of the sentences in that text.
1276 """
-> 1277 return list(self.sentences_from_text(text, realign_boundaries))
1278
1279 def debug_decisions(self, text):
~/env/lib64/python3.6/site-packages/nltk/tokenize/punkt.py in sentences_from_text(self, text, realign_boundaries)
1329 follows the period.
1330 """
-> 1331 return [text[s:e] for s, e in self.span_tokenize(text, realign_boundaries)]
1332
1333 def _slices_from_text(self, text):
~/env/lib64/python3.6/site-packages/nltk/tokenize/punkt.py in <listcomp>(.0)
1329 follows the period.
1330 """
-> 1331 return [text[s:e] for s, e in self.span_tokenize(text, realign_boundaries)]
1332
1333 def _slices_from_text(self, text):
~/env/lib64/python3.6/site-packages/nltk/tokenize/punkt.py in span_tokenize(self, text, realign_boundaries)
1319 if realign_boundaries:
1320 slices = self._realign_boundaries(text, slices)
-> 1321 for sl in slices:
1322 yield (sl.start, sl.stop)
1323
~/env/lib64/python3.6/site-packages/nltk/tokenize/punkt.py in _realign_boundaries(self, text, slices)
1360 """
1361 realign = 0
-> 1362 for sl1, sl2 in _pair_iter(slices):
1363 sl1 = slice(sl1.start + realign, sl1.stop)
1364 if not sl2:
~/env/lib64/python3.6/site-packages/nltk/tokenize/punkt.py in _pair_iter(it)
316 it = iter(it)
317 try:
--> 318 prev = next(it)
319 except StopIteration:
320 return
~/env/lib64/python3.6/site-packages/nltk/tokenize/punkt.py in _slices_from_text(self, text)
1333 def _slices_from_text(self, text):
1334 last_break = 0
-> 1335 for match in self._lang_vars.period_context_re().finditer(text):
1336 context = match.group() + match.group('after_tok')
1337 if self.text_contains_sentbreak(context):
TypeError: ('cannot use a string pattern on a bytes-like object', 'occurred at index 0')
Upvotes: 3
Views: 3662
Reputation: 7361
As the first error says, codecs
is not callable. In fact is the name of the module.
You probably want:
data['text'] = data.apply(lambda row:
codecs.encode(row['text'], 'utf-8'), axis=1)
The error raised by word_tokenize
is due to the fact that the function is used on the previously encoded string: codecs.encode
renders the text into a bytes literal string.
From the codecs
doc:
Most standard codecs are text encodings, which encode text to bytes, but there are also codecs provided that encode text to text, and bytes to bytes.
word_tokenize
doesn't work with bytes literar, like the error says (last line of your error traceback).
If you remove the encoding passage it will work.
About your worries on the video: the prefix u
means unicode.1
The prefix b
means bytes literal.2 This is the prefix of the strings if you print your dataframe after the use of codecs.encode
.
In python 3 (I see from the traceback that your version is 3.6) the default string type is Unicode, so the u
is redundant and often not shown, but the strings are already unicode.
So I'm quite sure you are safe: you can safely not use codecs.encode
.
Upvotes: 3
Reputation: 571
You could even do something simpler:
df['text'] = df['text'].str.encode('utf-8')
Ref: https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Series.str.encode.html
Upvotes: 3