Lucas Ou-Yang
Lucas Ou-Yang

Reputation: 5655

Handling unicode conversion in python

For my project, everything must be in unicode. Here is my way of handling everything, all strings are passed into this function:

def unicodify(string):
    if not isinstance(string, unicode):
        return string.decode('utf8', errors='ignore')
    return string

Is the following method good practice for production code? If not, why and how would you suggest decoding to unicode? The errors='ignore' actually does not work for ValueErrors 'invalid \x escape', but i'm not sure how to properly handle that.

Thanks

Upvotes: 1

Views: 240

Answers (2)

Heikki Toivonen
Heikki Toivonen

Reputation: 31130

For you to even attempt to convert str type to unicode type you need to know the encoding of the data in str. utf8 is common, but not the only encoding out there.

Additionally, you could get str data that is not in any encoding (like arbitrary binary data). In that case you can not convert it to unicode. Or rather, you have two options: a) raise an exception or b) convert as much as you can and ignore errors. It depends on the application what you should do.

Upvotes: 0

falsetru
falsetru

Reputation: 368904

You may have invalid string literal.

\x should be followed by two hex values(digits, A, B, C, D, E, F, a, b, c, d, e, f).

Valid example:

>>> '\xA9'
'\xa9'
>>> '\x00'
'\x00'
>>> '\xfF'
'\xff'

Invalid example:

>>> '\xOO'
ValueError: invalid \x escape
>>> '\xl3'
ValueError: invalid \x escape
>>> '\x5'
ValueError: invalid \x escape

See String literals.

Upvotes: 1

Related Questions