gdhgdh
gdhgdh

Reputation: 3

Unicode in Python - parsing JSON

I've written this little code to take JSON files and import their contents into a consul key-value store - I was quite pleased that the recursion works exactly as I expect, however less pleased when the source .json files contained non-ASCII:

#!/usr/bin/python

import sys
import json

filename = str(sys.argv[1])
fh = open(filename)

def printDict (d, path):
  for key in d:
    if isinstance(d[key], dict):
      printDict(d[key], path + str(key) + "/")
    else:
      print 'curl -X PUT http://localhost:8500/v1/kv/' + filename + path + key + ' -d "' + str(d[key]) + '"'
  return

j = json.load(fh)
printDict(j, "/")

A sample failing JSON file on disk:

{
    "FacetConfig" : {
        "facet:price-lf-p" : {
             "prefixParts" : "£"
        }
    }
}

When I run the code as-is, I'm getting a nasty exception because that nice simple str() can't convert the UK currency pound sign to 7-bit ASCII:

UnicodeEncodeError: 'ascii' codec can't encode character u'\xa3' in position 0: ordinal not in range(128)

How can I solve this without making too much of a dog's dinner of code that started out small and elegant? :)

Upvotes: 0

Views: 2714

Answers (3)

7stud
7stud

Reputation: 48599

How can I solve this without making too much of a dog's dinner of code that started out small and elegant?

Unfortunately, there a several additional steps necessary to prevent decoding/encoding errors. python 2.x has lots of places where it does implicit encoding/decoding, i.e behind your back and without your permission. When python does an implicit encoding/decoding it uses the ascii codec, which will result in an encoding/decoding error if a utf-8(or any other non-ascii) character is present. As a result, you have to find all the places where python does implicit encodings/decodings and replace them with explicit encodings/decodings--if you want your program to handle non-ascii characters in those places.

At the very least, any input from an external source should be decoded into a unicode string before proceeding, which means you have to know the input's encoding. But then if you combine unicode strings with regular strings, you can get encoding/decoding errors, for instance:

#-*- coding: utf-8 -*-   #Allows utf-8 characters in your source code
unicode_str = '€'.decode('utf-8')
my_str = '{0}{1}'.format('This is the Euro sign: ', unicode_str) 

--output:--
Traceback (most recent call last):
  File "1.py", line 3, in <module>
    my_str = '{0}{1}'.format('hello', unicode_str) 
UnicodeEncodeError: 'ascii' codec can't encode character u'\u20ac' in position 0: ordinal not in range(128)

Therefore all your strings probably should be decoded into unicode strings. Then when you want to output the strings, you need to encode the unicode strings.

import sys
import json
import codecs
import urllib

def printDict(d, path, filename):
    for key, val in d.items():  #key is a unicode string, val is a unicode string or dict
        if isinstance(val, dict): 
            printDict(
                val,
                u'{0}{1}/'.format(path, key),  #format() specifiers require 0,1 for python 2.6
                filename
            )
        else:
            key_str = key.encode('utf-8')
            val_str = val.encode('utf-8')

            url = '{0}{1}{2} -d "{3}"'.format(
                filename, 
                path, 
                key_str, 
                val_str
            )
            print url
            url_escaped = urllib.quote(url)
            print url_escaped

            curl_cmd = 'curl -X PUT'            
            base_url = 'http://localhost:8500/v1/kv/'
            print "{0} {1}{2}".format(curl_cmd, base_url, url_escaped)


filename = sys.argv[1].decode('utf-8')
file_encoding = 'utf-8'
fh = codecs.open(filename, encoding=file_encoding)
my_json = json.load(fh)
fh.close()

print my_json

path = "/"
printDict(my_json, path.decode('utf-8'), filename)  #Can the path have  non-ascii characters in it?

--output:--
{u'FacetConfig': {u'facet:price-lf-p': {u'prefixParts': u'\xa3'}}}
data.txt/FacetConfig/facet:price-lf-p/prefixParts -d "£"
data.txt/FacetConfig/facet%3Aprice-lf-p/prefixParts%20-d%20%22%C2%A3%22
curl -X PUT http://localhost:8500/v1/kv/data.txt/FacetConfig/facet%3Aprice-lf-p/prefixParts%20-d%20%22%C2%A3%22

Upvotes: 0

Martijn Pieters
Martijn Pieters

Reputation: 1121914

Rather than use str(), encode the unicode value explicitly. Since you are using your value as a URL element, you'll have to encode to your key UTF-8, then URL-quote that; the value just needs encoding to UTF-8.

import urllib

print ('curl -X PUT http://localhost:8500/v1/kv/' + filename + path +
       urllib.quote(key.encode('utf8')) + ' -d "' + 
       unicode(d[key]).encode('utf8') + '"')

You could use string formatting here to make that a little more readable:

print 'curl -X PUT http://localhost:8500/v1/kv/{}{}{} -d "{}"'.format(
    filename, path, urllib.quote(key.encode('utf8')), 
    unicode(d[key]).encode('utf8'))

The unicode() call is redundant if d[key] is always a string value, but if you also have numbers, booleans or None values this will make sure the code continues to work.

The server may expect a Content-Type header; if you do send one, perhaps consider adding a charset=utf8 parameter to the header. It looks like Consul treats the data as opaque, however.

Upvotes: 1

jme
jme

Reputation: 20695

Simply remove the str from str(d[key]). That is,

print ('curl -X PUT http://localhost:8500/v1/kv/' + filename + 
       path + key + ' -d "' + str(d[key]) + '"')

becomes:

print ('curl -X PUT http://localhost:8500/v1/kv/' + filename + 
       path + key + ' -d "' + d[key] + '"')

The problem here is that the str type in Python 2 is basically limited to ASCII characters. type(d[key]) is unicode, so you can't convert it to str... but that's fine, we can print it anyways.

Upvotes: 1

Related Questions