Anuj
Anuj

Reputation: 9622

character encoding in python

I have a byte stream that looks like this '\xe6\x97\xa5\xe6\x9c\xac\xe8\xaa\x9e'

str_data = '\xe6\x97\xa5\xe6\x9c\xac\xe8\xaa\x9e'

str_data is wrote into text file using the following code

file = open("test_doc","w")
file.write(str_data)
file.close()

If test_doc is opened in a web browser and character encoding is set to Japanese it just works fine.

I am using reportlab for generating pdf . using the following code

from reportlab.pdfbase import pdfmetrics
from reportlab.pdfgen.canvas import Canvas
from reportlab.pdfbase.cidfonts import CIDFont


pdfmetrics.registerFont(CIDFont('HeiseiMin-W3','90ms-RKSJ-H'))
pdfmetrics.registerFont(CIDFont('HeiseiKakuGo-W5','90ms-RKSJ-H'))
c = Canvas('test1.pdf')
c.setFont('HeiseiMin-W3-90ms-RKSJ-H', 6)

message1 = '\202\261\202\352\202\315\225\275\220\254\226\276\222\251\202\305\202\267\201B'

message3 = '\xe3\x83\x86\xe3\x82\xb9\xe3\x83\x88';

c.drawString(100, 675,message1)
c.save()

Here I use message1 variable which gives output in Japanese I need to use message3 instead of message1 to generate the pdf. message3 generated garabage probably because of improper encoding.

Upvotes: 1

Views: 2756

Answers (3)

dawg
dawg

Reputation: 103884

If you need to detect these encodings on the fly, you can take a look at Mark Pilgrim's excellent open source Universal Encoding Detector.

#!/usr/bin/env python

import chardet 
message1 = '\202\261\202\352\202\315\225\275\220\254\226\276\222\251\202\305\202\267\201B'
print chardet.detect(message1)
message3 = '\xe3\x83\x86\xe3\x82\xb9\xe3\x83\x88'
print chardet.detect(message3)
str_data = '\xe6\x97\xa5\xe6\x9c\xac\xe8\xaa\x9e'
print chardet.detect(str_data)

Output:

{'confidence': 0.99, 'encoding': 'SHIFT_JIS'}
{'confidence': 0.87625, 'encoding': 'utf-8'}
{'confidence': 0.87625, 'encoding': 'utf-8'}

Upvotes: 1

John Machin
John Machin

Reputation: 82934

Here is an answer:

message1 is encoded in shift_jis; message3 and str_data are encoded in UTF-8. All appear to represent Japanese text. See the following IDLE session:

>>> message1 = '\202\261\202\352\202\315\225\275\220\254\226\276\222\251\202\305\202\267\201B'
>>> print message1.decode('shift_jis')
これは平成明朝です。
>>> message3 = '\xe3\x83\x86\xe3\x82\xb9\xe3\x83\x88'
>>> print message3.decode('UTF-8')
テスト
>>>str_data = '\xe6\x97\xa5\xe6\x9c\xac\xe8\xaa\x9e'
>>> print str_data.decode('UTF-8')
日本語
>>> 

Google Translate detects the language as Japanese and translates them to the English "This is the Heisei Mincho.", "Test", and "Japanese" respectively.

What is the question?

Upvotes: 2

Achim
Achim

Reputation: 15712

I guess you have to learn more about encoding of strings in general. A string in python has no encoding information attached, so it's up to you to use it in the right way or convert it appropriately. Have a look at unicode strings, the encode / decode methods and the codecs module. And check whether c.drawString might also allow to pass a unicode string, which might make your live much easier.

Upvotes: 0

Related Questions