peterretief
peterretief

Reputation: 2067

When to use unicode(string) and string.encode('utf-8') in python

I had some odd characters coming through with spreadsheet cell data, I tried to resolve it with encode('utf-8') as was suggested. It didn't resolve the problem but when I used unicode(string) it worked. My question is there a standard way to deal with all types of text data?

Upvotes: 1

Views: 181

Answers (1)

georg
georg

Reputation: 214969

To put it very basically, a "string" ("unicode string" in python2 and just "string" in python3) is a sequence of "characters". But "character" is an abstraction, there's no way store a character in a file system or send it over network (sounds weird, but there really isn't). File systems, networks, consoles and other devices only understand "bytes". Therefore, it's your job as a programmer to correctly translate characters to bytes and vice versa when you talk to a device or an external program.

Chars-to-bytes translation is called "encode()" in python. When you send a string to a device, you "encode()" your characters to bytes:

some_chunk_of_bytes = some_string.encode(how_exactly)

There are many ways (called "character encodings") to represent a character as a combination of bytes, therefore you have to explain the encoder how exactly you want it to be done.

When you read the data from somewhere, you only get raw bytes and have to "decode()" them to meaningful characters:

some_string = some_chunk_of_bytes.decode(how_exactly)

Again, you have to specify how you think these bytes are encoded (there's no way to tell for sure).

There are a number of shortcuts in python that hide this encode/decode stuff from you. For example,

 string = unicode(bytes)

does this behind the scenes:

 string = bytes.decode(default-encoding)

and when you do something as simple as

print string

it's actually:

sys.stdout.write(string.encode(default-encoding))

But even if you don't use encode/decode explicitly, you have to realize it still must take place at some point. If you get garbled characters in your program, it's always because you:

  • forgot the "encode" step, or
  • forgot the "decode" step, or
  • provided an incorrect "encoding"

As said, this description is very basic, if you want to understand all the details, please read

Upvotes: 3

Related Questions