Phil Sheard
Phil Sheard

Reputation: 2162

Unicode / Ascii errors importing text into Pandas Dataframe

I'm using Python (2.7) and Requests to grab data from the Facebook API and then using Pandas to report on the output via IPython. Somewhere along the journey I'm encountering Unicode / Ascii errors and I'm stumped about what to change.

Hoping the solution will be obvious to someone well versed in the area.


First, I'm using Requests to grab the API data using a helper module I've created.

_current_request = https://graph.facebook.com/officialstackoverflow/feed?access_token=[Redacted access token]
response = requests.get(_current_request)

Requests.json() fails straight away due to encoding, so I've been using the following:

encoded = response.content.encode("utf-8")  # Excuse verbosity, just trying
json_response = json.loads(encoded)         # to be clear on my thought
response_list = list()                      # process, and hoping it will help
response_list += json_response["data"]      # debugging.

(The "data" key is the actual contents from the FB API. It's a list of individual post objects)

I'm then passing the response_list object back into the IPython notebook to manipluate.

[1] pd.DataFrame(response_list)

Traceback:

---------------------------------------------------------------------------
UnicodeDecodeError                        Traceback (most recent call last)
<ipython-input-290-0613adf928ec> in <module>()

/Users/Shared/Sites/Virtualenv/api_manager/lib/python2.7/site-packages/IPython/core/displayhook.pyc in __call__(self, result)
    236                 self.write_format_data(format_dict, md_dict)
    237                 self.log_output(format_dict)
--> 238             self.finish_displayhook()
    239 
    240     def cull_cache(self):

/Users/Shared/Sites/Virtualenv/api_manager/lib/python2.7/site-packages/IPython/kernel/zmq/displayhook.pyc in finish_displayhook(self)
     70         sys.stderr.flush()
     71         if self.msg['content']['data']:
---> 72             self.session.send(self.pub_socket, self.msg, ident=self.topic)
     73         self.msg = None
     74 

/Users/Shared/Sites/Virtualenv/api_manager/lib/python2.7/site-packages/IPython/kernel/zmq/session.pyc in send(self, stream, msg_or_type, content, parent, ident, buffers, track, header, metadata)
    647         if self.adapt_version:
    648             msg = adapt(msg, self.adapt_version)
--> 649         to_send = self.serialize(msg, ident)
    650         to_send.extend(buffers)
    651         longest = max([ len(s) for s in to_send ])

/Users/Shared/Sites/Virtualenv/api_manager/lib/python2.7/site-packages/IPython/kernel/zmq/session.pyc in serialize(self, msg, ident)
    551             content = self.none
    552         elif isinstance(content, dict):
--> 553             content = self.pack(content)
    554         elif isinstance(content, bytes):
    555             # content is already packed, as in a relayed message

/Users/Shared/Sites/Virtualenv/api_manager/lib/python2.7/site-packages/IPython/kernel/zmq/session.pyc in <lambda>(obj)
     83 # disallow nan, because it's not actually valid JSON
     84 json_packer = lambda obj: jsonapi.dumps(obj, default=date_default,
---> 85     ensure_ascii=False, allow_nan=False,
     86 )
     87 json_unpacker = lambda s: jsonapi.loads(s)

/Users/Shared/Sites/Virtualenv/api_manager/lib/python2.7/site-packages/zmq/utils/jsonapi.pyc in dumps(o, **kwargs)
     38         kwargs['separators'] = (',', ':')
     39 
---> 40     s = jsonmod.dumps(o, **kwargs)
     41 
     42     if isinstance(s, unicode):

/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/json/__init__.pyc in dumps(obj, skipkeys, ensure_ascii, check_circular, allow_nan, cls, indent, separators, encoding, default, sort_keys, **kw)
    248         check_circular=check_circular, allow_nan=allow_nan, indent=indent,
    249         separators=separators, encoding=encoding, default=default,
--> 250         sort_keys=sort_keys, **kw).encode(obj)
    251 
    252 

/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/json/encoder.pyc in encode(self, o)
    208         if not isinstance(chunks, (list, tuple)):
    209             chunks = list(chunks)
--> 210         return ''.join(chunks)
    211 
    212     def iterencode(self, o, _one_shot=False):

UnicodeDecodeError: 'ascii' codec can't decode byte 0xe2 in position 17915: ordinal not in range(128)

Clearly it's an issue along the path with the encoding / decoding of Unicode objects into the DataFrame, but what's confusing me is that Pandas has a native Unicode object so I'm not sure why the Ascii conversion is taking place anyway.

Thanks in advance for any help, and please ask if I need to add any further info.


Additional info

I've looking into the data types for each dictionary key and confirmed that it's a mix of sub-dicts and Unicode objects:

Key: picture, <type 'unicode'>
Key: story, <type 'unicode'>
Key: likes, <type 'dict'>
Key: from, <type 'dict'>
Key: comments, <type 'dict'>
Key: message_tags, <type 'dict'>
Key: privacy, <type 'dict'>
Key: actions, <type 'list'>
Key: updated_time, <type 'unicode'>
Key: to, <type 'dict'>
Key: link, <type 'unicode'>
Key: object_id, <type 'unicode'>
Key: story_tags, <type 'dict'>
Key: created_time, <type 'unicode'>
Key: message, <type 'unicode'>
Key: type, <type 'unicode'>
Key: id, <type 'unicode'>
Key: status_type, <type 'unicode'>
Key: icon, <type 'unicode'>

I've tried re-encoding each of these into str but that's not helping either - and it also doesn't seem like it should be necessary, as Pandas can handle Unicode anyway.

Upvotes: 0

Views: 1496

Answers (2)

Thomas K
Thomas K

Reputation: 40390

Reposting as an answer:

This is a known issue in at least IPython 3.0, and probably older versions. A fix has been merged, and will be in IPython 3.1.

The issue only affects Python 2.

Upvotes: 2

szeitlin
szeitlin

Reputation: 3351

I've had similar problems with unicode objects and pandas. Several things to consider:

  1. In my case, it helped to look at the raw data before trying to make a dataframe out of it, and to use an editor other than IPython notebook (e.g., try vim -b Rawfile.txt to look for byte order markers, magic numbers, etc.). The .ipynb display can make things prettier than you actually want, which means it's doing some things under the hood for display purposes only. Some objects may be different than they appear.

  2. What fixed it for me was passing my object into codecs first, and saving it to a file before converting it to a dataframe. Maybe try that with a portion of your data. That also makes it easy to try different encodings.


import codecs

opened = codecs.open(myObject, 'rU', 'UTF16')

df = pandas.DataFrame(opened, index = 'ColName')

  1. As you may already know, sometimes pandas will complain if there are missing values and it can't figure out how to coerce all your objects into a symmetric shape, e.g. if there is mismatched nesting of hierarchical indexing because you're not allowing NaNs.

    Make sure that the lengths of the unicode objects match up to the lengths of the dicts.

    Try passing column names to go with the unicode objects (which I'm guessing may not have keys, the way the dicts should), and make sure it knows which column(s) to use for indexing.

Upvotes: 1

Related Questions