Reputation: 1329
Given in arbitrary "string" from a library I do not have control over, I want to make sure the "string" is a unicode type and encoded in utf-8. I would like to know if this is the best way to do this:
import types
input = <some value from a lib I dont have control over>
if isinstance(input, types.StringType):
input = input.decode("utf-8")
elif isinstance(input, types.UnicodeType):
input = input.encode("utf-8").decode("utf-8")
In my actual code I wrap this in a try/except and handle the errors but I left that part out.
Upvotes: 8
Views: 3250
Reputation: 177891
I think you have a misunderstanding of Unicode and encodings. Unicode characters are just numbers. Encodings are the representation of the numbers. Think of Unicode characters as a concept like fifteen, and encodings as 15, 1111, F, XV. You have to know the encoding (decimal, binary, hexadecimal, roman numerals) before you can decode an encoding and "know" the Unicode value.
If you have no control over the input string, it is difficult to convert it to anything. For example, if the input was read from a file you'd have to know the encoding of the text file to decode
it meaningfully to Unicode, and then encode
it into 'UTF-8' for your C++ library.
Upvotes: 2
Reputation: 2260
Are you sure you want a UTF-8 encoded sequence stored in a Unicode type? Normally, Python stores characters in a types.UnicodeType using UCS-2 or -4, what is sometimes referred to as "wide" characters, which should be capable of containing characters from all reasonably common scripts.
One wonders what sort of lib this is that sometimes outputs types.StringType and sometimes types.UnicodeType. If I would take a wild guess, the lib always produces type.StringType, but doesn't tell which encoding it is in. If that is the case, you are actually looking for code that can guess what charset a type.StringType is encoded as.
In most cases, this is easy as you can assume that it is either in e.g. latin-1 or UTF-8. If the text can actually be in any odd encoding (e.g. incoming mail w/o proper header) you need a lib that guesses encoding. See http://chardet.feedparser.org/.
Upvotes: 0
Reputation: 10958
A Unicode object is not encoded (it is internally but this should be transparent to you as a Python user). The line input.encode("utf-8").decode("utf-8")
does not make much sense: you get the exact same sequence of Unicode characters at the end that you had in the beginning.
if isinstance(input, str):
input = input.decode('utf-8')
is all you need to ensure that str objects (byte strings) are converted into Unicode strings.
Upvotes: 6
Reputation: 34718
Simply;
try:
input = unicode(input.encode('utf-8'))
except ValueError:
pass
Its always better to seek forgiveness than ask permission.
Upvotes: 2