J.P. Hutchins
J.P. Hutchins

Reputation: 33

Add a non escaped escape character to python bytearray

I have an API that is demanding that the quotation marks in my XML attributes are escaped, so <cmd_id="1"> will not work, it requires <cmd_id=\"1\">.

I have tried iterating through my string, for example:

b'<?xml version=\'1.0\' encoding=\'utf8\'?>\n<tx><cmd id="1">SetChLevel</cmd><name>C</name><value>30</value></tx>'

Each time that I encounter a " (ascii 34) I will replace it with an escape character (ascii 92) and another quote. Infuriatingly this results in:

b'<?xml version=\'1.0\' encoding=\'utf8\'?>\n<tx><cmd id=\\"1\\">SetChLevel</cmd><name>C</name><value>30</value></tx>'

where the escapes have been escaped. As a sanity check I replaced 92 with any other character and it works as expected.

temp = b'<?xml version=\'1.0\' encoding=\'utf8\'?>\n<tx><cmd id="1">\
SetChLevel</cmd><name>C</name><value>30</value></tx>'

i = 0
j = 0
payload = bytearray(len(temp) + 4)

for char in temp:
    if char == 34:
        payload[i] = 92
        i += 1
        payload[i] = 34
        i += 1
        j += 1
    else:
        payload[i] = temp[j]
        i += 1
        j += 1

print(bytes(payload))

I would assume that character 92 would appear once but something is escaping the escape!

Upvotes: 0

Views: 1293

Answers (1)

Grismar
Grismar

Reputation: 31354

Your problem is the result of a very common misunderstanding for programmers new to Python.

When printing a string (or bytes) to the console, Python escapes the escape character (\) to show a string that, when used in Python as a literal, would give you the exact same value.

So:

s = 'abc\\abc'
print(s)

Prints abc\abc, but on the interpreter you get:

>>> s = 'abc\\abc'
>>> print(s)
abc\abc
>>> s
'abc\\abc'

Note that this is correct. After all print(s) should show the string on the console as it is, while s on the interpreter is asking Python to show you the representation of s, which includes the quotes and the escape characters.

Compare:

>>> repr(s)
"'abc\\\\abc'"

repr here prints the representation of the representation of s.

For bytes, things are further complicated because the representation is printed when using print, since print prints a string and a bytes needs to be decoded first, i.e.:

>>> print(some_bytes.decode('utf-8'))  # or whatever the encoding is

In short: your code was doing what you wanted it to, it does not duplicate escape characters, you only thought it did because you were looking at the representation of the bytes, not the actual bytes content.

By the way, this also means that you don't have to be paranoid and go through the trouble of writing custom code to replace characters based on their ASCII values, you can simply:

>>> example = bytes('<some attr="value">test</some>', encoding='utf-8')
>>> result = example.replace(b'"', b"\\\"")
>>> print(result.decode('utf-8'))
<some attr=\"value\">test</some>

I won't pretend that b"\\\"" is intuitive, perhaps b'\\"' is better - but both require that you understand the difference between the representation of a string, or its printed value.

So, finally:

>>> example = b'<some attr="value">test</some>'
>>> result = example.replace(b'"', b'\\"')
>>> print(result.decode('utf-8'))
<some attr=\"value\">test</some>

Upvotes: 3

Related Questions