Saeko
Saeko

Reputation: 441

How to decode escape sequence in string written in decimal properly

I have a piece of code that contains strings with UTF-8 escape sequences written in decimal, such as

my_string = "Hello\035"

which should be then interpreted as

Hello#

I don't mind parsing the decimal value, so far I've used things like this for the entire string and this seems to work the best (no error and does something):

print(codecs.escape_decode(my_string)[0].decode("utf-8"))

But the numbering seems quite off, because I have to use \043 escape sequence in order to get the hastag (#) decoded properly, and it's the same for all the other characters.

Upvotes: 2

Views: 610

Answers (1)

Kevin
Kevin

Reputation: 76254

You can't unambiguously detect and replace all \ooo escape sequences from a string literal, because those escape sequences are irretrievably replaced with their corresponding character values before your first line of code ever runs. As far as Python is concerned, "foo\041" and "foo!" are 100% identical, and there's no way to determine that the former object was defined with an escape sequence and the latter wasn't.

If you have some flexibility in regards to the form of the input data, then you might still be able to do what you want. For example, if you're allowed to use raw strings instead of regular strings, then r"Hello\035" won't get interpreted as "Hello, followed by a hash tag" before run time. It will be interpreted as "Hello, followed by backslash, followed by 0 3 and 5". Since the digit characters are still accessible, you can manipulate them in your code. For example,

import re

def replace_decimal_escapes(s):
    return re.sub(
        #locate all backslashes followed by three digits
        r"\\(\d\d\d)",
        #fetch the digit group, interpret them as decimal integer, then get cooresponding char
        lambda x: chr(int(x.group(1), 10)), 
        s
    )

test_strings = [
    r"Hello\035",
    r"foo\041",
    r"The \040quick\041 brown fox jumps over the \035lazy dog"
]

for s in test_strings:
    result = replace_decimal_escapes(s)
    print("input:  ", s)
    print("output: ", result)

Result:

input:   Hello\035
output:  Hello#
input:   foo\041
output:  foo)
input:   The \040quick\041 brown fox jumps over the \035lazy dog
output:  The (quick) brown fox jumps over the #lazy dog

As a bonus, this approach also works if you get your input strings via input(), since backslashes typed in that prompt by the user aren't interpreted as escape sequences. If you do print(replace_decimal_escapes(input())) and the user types "Hello\035", then the output will be "Hello#" as desired.

Upvotes: 2

Related Questions