divspec
divspec

Reputation: 63

How to encode a unicode string (ones from JSON) to 'utf-8' in python?

I am creating a REST API using Flask-Python. One of the urls (/uploads) takes in (a POST HTTP request) and a JSON '{"src":"void", "settings":"my settings"}'. I can individually extract each object and encode to a byte string which can then be hashed using hashlib in python. However, my goal is to take the whole string and then encode so it looks like...myfile.encode('utf-8'). Printing myfile displays as follows >> {u'src':u'void', u'settings':u'my settings'}, is there anyway I can take the above unicoded string then encode to utf-8 to a sequence of bytes for hashlib.sha1(mayflies.encode('uff-8'). Do let me know for more clarification. Thanks in advance.

fileSRC = request.json['src']
fileSettings = request.json['settings']

myfile = request.json
print myfile

#hash the filename using sha1 from hashlib library
guid_object = hashlib.sha1(fileSRC.encode('utf-8')) // this works however I want myfile to be encoded not fileSRC
guid = guid_object.hexdigest() //this works 
print guid

Upvotes: 3

Views: 5563

Answers (1)

Quentin Pradet
Quentin Pradet

Reputation: 4771

As you said in comments, you solved your issue using:

jsonContent = json.dumps(request.json)
guid_object = hashlib.sha1(jsonContent.encode('utf-8'))

But it's important to understand why this works. Flask sends you unicode() for non-ASCII, and str() for ASCII. Dumping the result using JSON will give you consistent results since it abstracts away the internal Python representation, just as if you only had unicode().

Python 2

In Python 2 (the Python version you're using), you don't need .encode('utf-8') because the default value of ensure_ascii of json.dumps() is True. When you send non-ASCII data to json.dumps(), it will use JSON escape sequences to actually dump ASCII: no need to encode to UTF-8. Also, since the Zen of Python says that "Explicit is better than implicit", even if ensure_ascii is already True, you could specify it:

jsonContent = json.dumps(request.json, ensure_ascii=True)
guid_object = hashlib.sha1(jsonContent)

Python 3

In Python 3 however, this would no longer work. Inded, json.dumps() returns unicode in Python 3, even if everything in the unicode string is ASCII. But hashlib.sha1 only works on bytes. You need to make the conversion explicit, even if the ASCII encoding is all you need:

jsonContent = json.dumps(request.json, ensure_ascii=True)
guid_object = hashlib.sha1(jsonContent.encode('ascii'))

This is why Python 3 is a better language: it forces you to be more explicit about the text you use, whether it is str (Unicode) or bytes. This avoids many, many problems down the road.

Upvotes: 1

Related Questions