Reputation: 73
I fetched the subject of an email message using python modules and received string
'=D8=B3=D9=84=D8=A7=D9=85_=DA=A9=D8=AC=D8=A7=D8=A6=DB=8C?='
I know the string is encoded in 'utf-8'. Python has a method called on strings to decode such strings. But to use the method I needed to replace =
sign with \x
string. By manual interchange and then printing the decoded resulting string, I get the string سلام_کجائی which is exactly what I want. The question is how I can do the interchange automatically? The answer seems harder than just simple usage of functions on strings like replace function.
Below I brought the code I used after manual operation?
r='\xD8\xB3\xD9\x84\xD8\xA7\xD9\x85_\xDA\xA9\xD8\xAC\xD8\xA7\xD8\xA6\xDB\x8C'
print r.decode('utf-8')
I would appreciate any workable idea.
Upvotes: 7
Views: 9666
Reputation: 14328
for Python 3, decode \x
like string, use b prefix:
>>> b"\xe4\xb8\x8b\xe4\xb8\x80\xe6\xad\xa5".decode("utf-8")
'下一步'
Upvotes: 0
Reputation: 62928
Just decode it from quoted-printable to get utf8-encoded bytestring:
In [35]: s = '=D8=B3=D9=84=D8=A7=D9=85_=DA=A9=D8=AC=D8=A7=D8=A6=DB=8C?='
In [36]: s.decode('quoted-printable')
Out[36]: '\xd8\xb3\xd9\x84\xd8\xa7\xd9\x85_\xda\xa9\xd8\xac\xd8\xa7\xd8\xa6\xdb\x8c?'
Then, if needed, from utf-8 to unicode:
In [37]: s.decode('quoted-printable').decode('utf8')
Out[37]: u'\u0633\u0644\u0627\u0645_\u06a9\u062c\u0627\u0626\u06cc?'
In [39]: print s.decode('quoted-printable')
سلام_کجائی?
Upvotes: 8
Reputation: 5919
This sort of encoding is known as quoted-printable. There is a Python module for performing encoding and decoding.
You're right that it's just a pure quoting of binary strings, so you need to apply UTF-8 decoding afterwards. (Assuming the string is in UTF-8, of course. But that looks correct although I don't know the language.)
import quopri
print quopri.decodestring( "'=D8=B3=D9=84=D8=A7=D9=85_=DA=A9=D8=AC=D8=A7=D8=A6=DB=8C?='" ).decode( "utf-8" )
Upvotes: 4