encoding text columns in pandas data frame

Question

Where am I going wrong with this? I am trying to iterate over each row of my data frame and encode the text.

data['text'] = data.apply(lambda row: 
    codecs(row['text'], "r", 'utf-8'), axis=1)

I get this error - why is the uft encoding affecting the part of the code, if I do not run the UTF encoding I do not get an error:

    TypeError                                 Traceback (most recent call last)
     in 
    ----> 1 data['text'] = codecs(data['text'], "r", 'utf-8')
          2 
          3 data['text'] = data.apply(lambda row: 
          4     codecs(row['text'], "r", 'utf-8'), axis=1)

    TypeError: 'module' object is not callable

When I apply the solutions, both work however I get this error:

    data['text_tokens'] = data.apply(lambda row: 
        nltk.word_tokenize(row['text']), axis=1)

Error:

---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
 in 
      1 data['text_tokens'] = data.apply(lambda row: 
----> 2     nltk.word_tokenize(row['text']), axis=1)

~/env/lib64/python3.6/site-packages/pandas/core/frame.py in apply(self, func, axis, broadcast, raw, reduce, result_type, args, **kwds)
   6485                          args=args,
   6486                          kwds=kwds)
-> 6487         return op.get_result()
   6488 
   6489     def applymap(self, func):

~/env/lib64/python3.6/site-packages/pandas/core/apply.py in get_result(self)
    149             return self.apply_raw()
    150 
--> 151         return self.apply_standard()
    152 
    153     def apply_empty_result(self):

~/env/lib64/python3.6/site-packages/pandas/core/apply.py in apply_standard(self)
    255 
    256         # compute the result using the series generator
--> 257         self.apply_series_generator()
    258 
    259         # wrap results

~/env/lib64/python3.6/site-packages/pandas/core/apply.py in apply_series_generator(self)
    284             try:
    285                 for i, v in enumerate(series_gen):
--> 286                     results[i] = self.f(v)
    287                     keys.append(v.name)
    288             except Exception as e:

 in (row)
      1 data['text_tokens'] = data.apply(lambda row: 
----> 2     nltk.word_tokenize(row['text']), axis=1)

~/env/lib64/python3.6/site-packages/nltk/tokenize/__init__.py in word_tokenize(text, language, preserve_line)
    142     :type preserve_line: bool
    143     """
--> 144     sentences = [text] if preserve_line else sent_tokenize(text, language)
    145     return [
    146         token for sent in sentences for token in _treebank_word_tokenizer.tokenize(sent)

~/env/lib64/python3.6/site-packages/nltk/tokenize/__init__.py in sent_tokenize(text, language)
    104     """
    105     tokenizer = load('tokenizers/punkt/{0}.pickle'.format(language))
--> 106     return tokenizer.tokenize(text)
    107 
    108 

~/env/lib64/python3.6/site-packages/nltk/tokenize/punkt.py in tokenize(self, text, realign_boundaries)
   1275         Given a text, returns a list of the sentences in that text.
   1276         """
-> 1277         return list(self.sentences_from_text(text, realign_boundaries))
   1278 
   1279     def debug_decisions(self, text):

~/env/lib64/python3.6/site-packages/nltk/tokenize/punkt.py in sentences_from_text(self, text, realign_boundaries)
   1329         follows the period.
   1330         """
-> 1331         return [text[s:e] for s, e in self.span_tokenize(text, realign_boundaries)]
   1332 
   1333     def _slices_from_text(self, text):

~/env/lib64/python3.6/site-packages/nltk/tokenize/punkt.py in (.0)
   1329         follows the period.
   1330         """
-> 1331         return [text[s:e] for s, e in self.span_tokenize(text, realign_boundaries)]
   1332 
   1333     def _slices_from_text(self, text):

~/env/lib64/python3.6/site-packages/nltk/tokenize/punkt.py in span_tokenize(self, text, realign_boundaries)
   1319         if realign_boundaries:
   1320             slices = self._realign_boundaries(text, slices)
-> 1321         for sl in slices:
   1322             yield (sl.start, sl.stop)
   1323 

~/env/lib64/python3.6/site-packages/nltk/tokenize/punkt.py in _realign_boundaries(self, text, slices)
   1360         """
   1361         realign = 0
-> 1362         for sl1, sl2 in _pair_iter(slices):
   1363             sl1 = slice(sl1.start + realign, sl1.stop)
   1364             if not sl2:

~/env/lib64/python3.6/site-packages/nltk/tokenize/punkt.py in _pair_iter(it)
    316     it = iter(it)
    317     try:
--> 318         prev = next(it)
    319     except StopIteration:
    320         return

~/env/lib64/python3.6/site-packages/nltk/tokenize/punkt.py in _slices_from_text(self, text)
   1333     def _slices_from_text(self, text):
   1334         last_break = 0
-> 1335         for match in self._lang_vars.period_context_re().finditer(text):
   1336             context = match.group() + match.group('after_tok')
   1337             if self.text_contains_sentbreak(context):

TypeError: ('cannot use a string pattern on a bytes-like object', 'occurred at index 0')

Valentino · Accepted Answer

Encoding

As the first error says, codecs is not callable. In fact is the name of the module.

You probably want:

data['text'] = data.apply(lambda row: 
    codecs.encode(row['text'], 'utf-8'), axis=1)

Tokenization

The error raised by word_tokenize is due to the fact that the function is used on the previously encoded string: codecs.encode renders the text into a bytes literal string.
From the codecs doc:

Most standard codecs are text encodings, which encode text to bytes, but there are also codecs provided that encode text to text, and bytes to bytes.

word_tokenize doesn't work with bytes literar, like the error says (last line of your error traceback).
If you remove the encoding passage it will work.

About your worries on the video: the prefix u means unicode.1
The prefix b means bytes literal.2 This is the prefix of the strings if you print your dataframe after the use of codecs.encode.
In python 3 (I see from the traceback that your version is 3.6) the default string type is Unicode, so the u is redundant and often not shown, but the strings are already unicode.
So I'm quite sure you are safe: you can safely not use codecs.encode.

encoding text columns in pandas data frame

Answers (2)

Encoding

Tokenization

Related Questions