Sasha Trubetskoy
Sasha Trubetskoy

Reputation: 3

How to decode Cyrillic WINDOWS-1251 string to unicode using python

I have a very large (2.5 GB) text file with Cyrillic characters in various encodings, including Windows-1251:

=D0=A0=D0=B2=D0=B8=D1=81=D1=8C =D0=B2 =D0=B0=D1=82=D0=B0=D0=BA=D1=83 =D0=BD= =D0=B0 =C2=AB=D0=9F=D0=B5=D1=80=D1=88=D0=B8=D0=BD=D0=B3=D0=B5=C2=BB

I have already tried .encode() and .decode() with various combinations of encodings, but I cannot get the text to be readable. I have also tried reading in binary mode.

with open('myfile.mbox', 'r') as f:
    unreadable_str = f.readline()

unreadable_str.encode('WINDOWS-1251').decode('utf-8') 

I thought it would encode the string into bytes using the Windows encoding and then give it back as readable Unicode, but instead, it always outputs the same string.

Upvotes: 0

Views: 2774

Answers (1)

Mark Tolonen
Mark Tolonen

Reputation: 177386

That data is encoded according to RFC 1522. The quopri module can be used to decode the data to bytes, which look like UTF-8-encoded data:

>>> s='''=D0=A0=D0=B2=D0=B8=D1=81=D1=8C =D0=B2 =D0=B0=D1=82=D0=B0=D0=BA=D1=83 =D0=BD= =D0=B0 =C2=AB=D0=9F=D0=B5=D1=80=D1=88=D0=B8=D0=BD=D0=B3=D0=B5=C2=BB'''
>>> quopri.decodestring(s).decode('utf8')
'Рвись в атаку н= а «Першинге»'

Upvotes: 4

Related Questions