Unicode / Ascii errors importing text into Pandas Dataframe

Question

I'm using Python (2.7) and Requests to grab data from the Facebook API and then using Pandas to report on the output via IPython. Somewhere along the journey I'm encountering Unicode / Ascii errors and I'm stumped about what to change.

Hoping the solution will be obvious to someone well versed in the area.

First, I'm using Requests to grab the API data using a helper module I've created.

_current_request = https://graph.facebook.com/officialstackoverflow/feed?access_token=[Redacted access token]
response = requests.get(_current_request)

Requests.json() fails straight away due to encoding, so I've been using the following:

encoded = response.content.encode("utf-8")  # Excuse verbosity, just trying
json_response = json.loads(encoded)         # to be clear on my thought
response_list = list()                      # process, and hoping it will help
response_list += json_response["data"]      # debugging.

(The "data" key is the actual contents from the FB API. It's a list of individual post objects)

I'm then passing the response_list object back into the IPython notebook to manipluate.

[1] pd.DataFrame(response_list)

Traceback:

---------------------------------------------------------------------------
UnicodeDecodeError                        Traceback (most recent call last)
 in ()

/Users/Shared/Sites/Virtualenv/api_manager/lib/python2.7/site-packages/IPython/core/displayhook.pyc in __call__(self, result)
    236                 self.write_format_data(format_dict, md_dict)
    237                 self.log_output(format_dict)
--> 238             self.finish_displayhook()
    239 
    240     def cull_cache(self):

/Users/Shared/Sites/Virtualenv/api_manager/lib/python2.7/site-packages/IPython/kernel/zmq/displayhook.pyc in finish_displayhook(self)
     70         sys.stderr.flush()
     71         if self.msg['content']['data']:
---> 72             self.session.send(self.pub_socket, self.msg, ident=self.topic)
     73         self.msg = None
     74 

/Users/Shared/Sites/Virtualenv/api_manager/lib/python2.7/site-packages/IPython/kernel/zmq/session.pyc in send(self, stream, msg_or_type, content, parent, ident, buffers, track, header, metadata)
    647         if self.adapt_version:
    648             msg = adapt(msg, self.adapt_version)
--> 649         to_send = self.serialize(msg, ident)
    650         to_send.extend(buffers)
    651         longest = max([ len(s) for s in to_send ])

/Users/Shared/Sites/Virtualenv/api_manager/lib/python2.7/site-packages/IPython/kernel/zmq/session.pyc in serialize(self, msg, ident)
    551             content = self.none
    552         elif isinstance(content, dict):
--> 553             content = self.pack(content)
    554         elif isinstance(content, bytes):
    555             # content is already packed, as in a relayed message

/Users/Shared/Sites/Virtualenv/api_manager/lib/python2.7/site-packages/IPython/kernel/zmq/session.pyc in (obj)
     83 # disallow nan, because it's not actually valid JSON
     84 json_packer = lambda obj: jsonapi.dumps(obj, default=date_default,
---> 85     ensure_ascii=False, allow_nan=False,
     86 )
     87 json_unpacker = lambda s: jsonapi.loads(s)

/Users/Shared/Sites/Virtualenv/api_manager/lib/python2.7/site-packages/zmq/utils/jsonapi.pyc in dumps(o, **kwargs)
     38         kwargs['separators'] = (',', ':')
     39 
---> 40     s = jsonmod.dumps(o, **kwargs)
     41 
     42     if isinstance(s, unicode):

/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/json/__init__.pyc in dumps(obj, skipkeys, ensure_ascii, check_circular, allow_nan, cls, indent, separators, encoding, default, sort_keys, **kw)
    248         check_circular=check_circular, allow_nan=allow_nan, indent=indent,
    249         separators=separators, encoding=encoding, default=default,
--> 250         sort_keys=sort_keys, **kw).encode(obj)
    251 
    252 

/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/json/encoder.pyc in encode(self, o)
    208         if not isinstance(chunks, (list, tuple)):
    209             chunks = list(chunks)
--> 210         return ''.join(chunks)
    211 
    212     def iterencode(self, o, _one_shot=False):

UnicodeDecodeError: 'ascii' codec can't decode byte 0xe2 in position 17915: ordinal not in range(128)

Clearly it's an issue along the path with the encoding / decoding of Unicode objects into the DataFrame, but what's confusing me is that Pandas has a native Unicode object so I'm not sure why the Ascii conversion is taking place anyway.

Thanks in advance for any help, and please ask if I need to add any further info.

Additional info

I've looking into the data types for each dictionary key and confirmed that it's a mix of sub-dicts and Unicode objects:

Key: picture, 
Key: story, 
Key: likes, 
Key: from, 
Key: comments, 
Key: message_tags, 
Key: privacy, 
Key: actions, 
Key: updated_time, 
Key: to, 
Key: link, 
Key: object_id, 
Key: story_tags, 
Key: created_time, 
Key: message, 
Key: type, 
Key: id, 
Key: status_type, 
Key: icon,

I've tried re-encoding each of these into str but that's not helping either - and it also doesn't seem like it should be necessary, as Pandas can handle Unicode anyway.

Thomas K · Accepted Answer

Reposting as an answer:

This is a known issue in at least IPython 3.0, and probably older versions. A fix has been merged, and will be in IPython 3.1.

The issue only affects Python 2.

Unicode / Ascii errors importing text into Pandas Dataframe

Additional info

Answers (2)

Related Questions