Losing data when encoding into utf8

Question

I am using python and receiving data about a user by using the Twitter api. What I get is a json encoded dict of attributes. For example

{

"id": 123456789,

"name": "ととり～む",

"screen_name": "somescreenname",

"description": "こんにちは♪ キャラ的にはこなたですが好きな子はつかさな私です。 ゲーム・漫画・アニメならなんでも好きです。 気が合う方はよろしくお願いします。 ",

}

Note: This is not the exact dict that I receive, but summarized for the question.

Some of my users have their data in another language, supposingly Japanese. I want to save the "name" of my user. When I use:

data["name"].encode('utf8')

I still end up losing some of the characters like this ￣ﾁﾨ￣ﾁﾨ￣ﾂﾊ￯ﾽﾞ￣ﾂﾀ. I don't want to lose any data, what is the best mechanism I can apply here?

lvc · Accepted Answer

I think you'll find you're not actually losing any data. You should be able to do:

data['name'].encode('utf8').decode('utf8')

and get back the original string. You can write the intermediate bytes object to disk, and read it back in later and decode it to the same effect.

What you seem to be worried about is the squares and other nonsense that come up when you print the encoded string - this is almost certainly a display issue, rather than data loss. Probably your terminal is attempting to interpret the bytes in a different encoding, resulting in Mojibake. As long as you are careful to keep your encodings straight in your program, this won't cause you issues - just check that you can do the round-trip above.

Losing data when encoding into utf8

Answers (1)

Related Questions