kdwftw
kdwftw

Reputation: 15

Encoding issues related to Python and foreign languages

Here's a problem I am facing with encoding and decoding texts.

I am trying to write a code that finds a 'string' or a 'byte' in a file, and return the path of the file.

Currently, since the files I am opening have encoding of 'windows-1252' or 'cp-1252', so I have been trying to: 1. encode my string into a byte corresponding to the encoding of the file 2. match the file and get the path of that file

I have a file, say 'f', that has the encoding of 'windows-1252' or 'cp-1252'. It includes a text that is in Chinese: '[跑Online農場]'

with open(os.path.join(root, filename), mode='rb') as f:
    text = f.read()
    print(encoding(text)) # encoding() is a separate function that I wrote that returns the encoding of the file
    print(text)
Windows-1252
b'\x00StaticText\x00\x00\x12\x00[\xb6]Online\xb9A\xb3\xf5]\x00\x01\x00\x ...

As you may see, the 'binary' texts for [跑Online農場] is [\xb6]Online\xb9A\xb3\xf5]

However, the funny thing is that if I literally convert the string into bytes, I get:

enter_text = '[跑Online農場]'
print(bytes(enter_text, 'cp1252'))
UnicodeEncodeError: 'charmap' codec can't encode character '\u8dd1' in position 1: character maps to <undefined>

On the other hand, opening the file using

with open(os.path.join(root, filename), mode='r', encoding='cp-1252') as f ...

I get:

StaticText   [¶]Online¹A³õ]   €?‹  Œ  î...

which I am not sure how I would 'translate' '[跑Online農場]' into '[¶]Online¹A³õ]'. Answer to this may also solve the problem

What should I do to correctly 'encode' the Chinese/Foreign characters so that it matches the 'rb' bytes that the Python returns?

Thank you!

Upvotes: 1

Views: 1111

Answers (1)

lenz
lenz

Reputation: 5817

Your encoding function is wrong: the codec of the file is probably CP950, but certainly not CP1252.

Note: guessing the encoding of a given byte string is always approximate. There's no safe way of determining the encoding for sure.

If you have a byte string like

b'[\xb6]Online\xb9A\xb3\xf5]'

and you know it must translate (be decoded) into

'[跑Online農場]'

then what you can is trial and error with a few codecs.

I did this with the list of codecs supported by Python, searching for codecs for Chinese.

When using CP-1252 (the Windows version of Latin-1), as you did, you get mojibake:

>>> b'[\xb6]Online\xb9A\xb3\xf5]'.decode('cp1252')
'[¶]Online¹A³õ]'

When using CP-950 (the Windows codepage for Traditional Chinese), you get the expected output:

>>> b'[\xb6]Online\xb9A\xb3\xf5]'.decode('cp950')
'[跑Online農場]'

So: use CP-950 for reading the file.

Upvotes: 1

Related Questions