ASHu2
ASHu2

Reputation: 2047

Python unicode conversion, decoded part does not recognise the encoded part

I am trying to use ODIA LANGUAGE for a project. When I encode an Odia string, and then try to decode the same, there is error.

b = "କାହିଁକି ଏଇଠି ଅଛୁ "
x = b.encode()
print(x)
m = x.decode()
print(m)

Then, the corresponding output is :

b'\xe0\xac\x95\xe0\xac\xbe\xe0\xac\xb9\xe0\xac\xbf\xe0\xac\x81\xe0\xac\x95\xe0\xac\xbf \xe0\xac\x8f\xe0\xac\x87\xe0\xac\xa0\xe0\xac\xbf \xe0\xac\x85\xe0\xac\x9b\xe0\xad\x81 '
Traceback (most recent call last):
  File "x:\Pythonxx36\Egod\expeppp.py", line 9, in <module>
    print(m)
  File "C:\ProgramData\Miniconda3\envs\pygpu\lib\encodings\cp1252.py", line 19, in encode
    return codecs.charmap_encode(input,self.errors,encoding_table)[0]
UnicodeEncodeError: 'charmap' codec can't encode characters in position 0-6: character maps to <undefined>

I did not mention any encoding because I am not sure if utf-8, utf-7 or utf-32 can encode Odia langauge.
But here, codecs goes directly to cp1252.py which should not have any relation here/(I am not sure if) \

So my questions are...

  1. Why does the same encoded text gives error during decoding ?
  2. What is cp1252.py ?
  3. How to create a new encoding in python if none of the Python Encodings support ODIA language ?
    Resource : Odia unicode block

\Question 1 and 2 are most important, 3 is optional/

Upvotes: 0

Views: 213

Answers (2)

Mark Tolonen
Mark Tolonen

Reputation: 177971

cp1252 is the default encoding for your terminal. Older versions of Python automatically encode Unicode strings to the terminal default encoding. You don't need to explicitly encode/decode, but you do need to use a terminal/IDE that supports the encoding you need for the characters being used. UTF-8 is the usual choice since it can handle all Unicode characters.

On Windows, Python versions 3.6 and greater handle Unicode better. The terminal encoding is ignored and Windows Unicode console APIs are used to write directly to the terminal window. You'll need a terminal font that supports the language to see the characters, or use an IDE that supports UTF-8:

Python 3.7.2 (tags/v3.7.2:9a3ffc0492, Dec 23 2018, 23:09:28) [MSC v.1916 64 bit (AMD64)] on win32
Type "help", "copyright", "credits" or "license" for more information.
>>> b = "କାହିଁକି ଏଇଠି ଅଛୁ "
>>> print(b)
କାହିଁକି ଏଇଠି ଅଛୁ

To write to a file, the default encoding is the value returned by locale.getpreferredencoding(False), which is going to be cp1252 for your system. Specify the encoding instead. UTF-8 works for all Unicode code points. For Python 3, use the following:

with open('out.txt','w',encoding='utf8') as f:
    f.write("କାହିଁକି ଏଇଠି ଅଛୁ ")

Use io.open in Python 2, which is compatible with the same syntax.

Always specify the encoding when reading or writing a file so code doesn't have to rely on a default that can change between different localized OS versions.

Many Windows applications assume the default encoding when reading a file instead of UTF-8, so you may want to use 'utf-8-sig' as the encoding to write a signature at the beginning of the file that Windows apps (ex: Excel) will recognize and use UTF-8 instead.

Upvotes: 1

Ned Batchelder
Ned Batchelder

Reputation: 375814

Your error is not during decoding. It's when you try to print. m is a Unicode string, successfully decoded from x. But when printing, Python tries to encode the string again to the encoding needed by your terminal. That encoding is cp1252, a Windows one-byte encoding. That encoding cannot handle Odia, so it fails.

For question 3, you cannot easily create a new encoding. You need to set your terminal to use an encoding that can handle Odia, like UTF8.

Upvotes: 3

Related Questions