Reputation: 101
I have some problems regarding escape characters.
Problem I:
I have a string in the form of:
String = "%C3%85"
String
is the representation of two bytes in UTF-8 encoding this char: "Å".
Except: "\x" is replaced with "%".
So I want to alter String to look like this:
String = "\xC3\x85"
Problem II:
I have a String in the form:
*String* = "\\x33"
Now I want to convert it into the UTF-8 byte representation of that which should look like:
String = b"\x33"
How do I do that?
Approaches I tried:
I tried using the replace method:
string.replace("%","\") -- wont work since \ escapes "
string.replace("%","\\") -- wont work since this produces problem II
string.replace("%","\x00").replace("00","") -- wont work since "\x00" is a char by its own.
bytes(string.replace("%","\\") ) -- wont work since this basically comes down to problem II
One approach that works but is way more work than seems to be needed is to create a dictionary with all characters in the form of:
"%00" = "\x00"
...
...
But well....this should be automatable since its basically just replacing % with x\
I am out of luck and couldnt find any help anywhere on the internet.
lmgtfy wont help me either;)
Thanks for any help!
Upvotes: 2
Views: 5252
Reputation: 5817
Both problems can probably be solved with the standard library.
Problem I looks like URL-Encoding, ie. the kind of "garbling" you see in query strings in the browser's address bar.
In Python 3, the urllib
module can handle this:
>>> import urllib.parse
>>> urllib.parse.unquote('%C3%85')
'Å'
For Problem II, you seem to have escape sequences as they are used in Python's string literals.
As you might know, you can type 'å'
or '\xe5'
in the source code to get exactly the same string, just as you can type 0.1
, .1
or 1e-1
to get the same float value.
Since the Python interpreter sees the four characters \
, x
, e
and 5
in your source code, it must have a way to convert this sequence into the character å
. And (part of) this algorithm is made available to Python programmers through the "unicode_escape" codec, which you can use like "normal" codecs such as "utf-8":
>>> '\\x33'.encode('ascii').decode('unicode_escape')
'3'
Since Python 3's str
type has no decode()
method, you have to encode it to bytes first.
If your input contains ASCII characters only, the above line works; also "latin-1" is possible for a mixture of Latin-1 characters and \xNN
escapes.
Upvotes: 1
Reputation: 36658
The problem is you have string representation of a hex encoded character byte array. You need to convert it from a string to hex, then let Python interpret it as the UTF-8 character encoding. Try this:
import re
String = "%C3%85"
out = bytearray(int(c, 16) for c in re.findall(r'%(\w\w)', String)).decode('utf8')
out
# returns:
'Å'
For you second part, the binary representation of '\x33'
is b'3'
. To get from the string '\\x33'
to b'3'
, you again need to strip out the string formatting, convert the string characters to hex, and convert to bytes.
String = '\\x33'
out = bytes(int(c, 16) for c in re.findall(r'\\x(\w\w)', String))
out
# returns:
b'3'
Upvotes: 0