Reputation: 9335
I have a dataframe that has a column, _text
, containing the text of an article. I'm trying to get the length of the article for each row in my dataframe. Here's my attempt:
from bs4 import BeautifulSoup
result_df['_text'] = [BeautifulSoup(text, "lxml").get_text() for text in result_df['_text']]
text_word_length = [len(str(x).split(" ")) for x in result_df['_text']]
Unfortunately, I get this error:
---------------------------------------------------------------------------
UnicodeEncodeError Traceback (most recent call last)
<ipython-input-8-f6c8ab83a46f> in <module>()
----> 1 text_word_length = [len(str(x).split(" ")) for x in result_df['_text']]
UnicodeEncodeError: 'ascii' codec can't encode character u'\u2019' in position 231: ordinal not in range(128)
Seems like I should be specifying "utf-8" somewhere, I'm just not sure where...
Thanks!
Upvotes: 1
Views: 15997
Reputation: 148870
I assume that you use a Python 2 version, and that your input text contains non ASCII characters. The problem arises at str(x)
which by default when x is a unicode string ends in x.encode('ascii')
You have 2 ways to solve this problem:
correctly encode the unicode string in utf-8:
text_word_length = [len(x.encode('utf-8').split(" ")) for x in result_df['_text']]
split the string as unicode:
text_word_length = [len(x.split(u" ")) for x in result_df['_text']]
Upvotes: 5
Reputation: 431
Acording to the official python documentation: Python Official Site
To define a source code encoding, a magic comment must be placed into the source files either as first or second line in the file, such as:
# coding=<encoding name>
or (using formats recognized by popular editors):
#!/usr/bin/python
# -*- coding: <encoding name> -*-
or:
#!/usr/bin/python
# vim: set fileencoding=<encoding name> :
Upvotes: 1