P. A. Monsaille
P. A. Monsaille

Reputation: 207

How to replace hex value in a string

While importing data from a flat file, I noticed some embedded hex-values in the string (<0x00>, <0x01>).

I want to replace them with specific characters, but am unable to do so. Removing them won't work either. What it looks like in the exported flat file: https://i.sstatic.net/qxiEl.png Another example: https://i.sstatic.net/NJR8G.png


This is what I've tried: (and mind, <0x01> represents a none-editable entity. It's not recognized here.)

import io
with io.open('1.txt', 'r+', encoding="utf-8") as p:
    s=p.read()
# included in case it bears any significance
import re
import binascii

s = "Some string with hex: <0x01>"

s = s.encode('latin1').decode('utf-8')
# throws e.g.: >>> UnicodeDecodeError: 'utf-8' codec can't decode byte 0xfc in position 114: invalid start byte

s = re.sub(r'<0x01>', r'.', s)
s = re.sub(r'\\0x01', r'.', s)
s = re.sub(r'\\\\0x01', r'.', s)
s = s.replace('\0x01', '.')
s = s.replace('<0x01>', '.')
s = s.replace('0x01', '.')

or something along these lines in hopes to get a grasp of it while iterating through the whole string:

for x in s:
    try:
        base64.encodebytes(x)
        base64.decodebytes(x)
        s.strip(binascii.unhexlify(x))
        s.decode('utf-8')
        s.encode('latin1').decode('utf-8')
    except:
        pass

Nothing seems to get the job done.

I'd expect the characters to be replacable with the methods I've dug up, but they are not. What am I missing? NB: I have to preserve umlauts (äöüÄÖÜ)

-- edit:

Could I introduce the hex-values in the first place when exporting? If so, is there a way to avoid that?

with io.open('out.txt', 'w', encoding="utf-8") as temp:
    temp.write(s)

Upvotes: 3

Views: 12847

Answers (1)

lenz
lenz

Reputation: 5817

Judging from the images, these are actually control characters. Your editor displays them in this greyed-out way showing you the value of the bytes using hex notation. You don't have the characters "0x01" in your data, but really a single byte with the value 1, so unhexlify and friends won't help.

In Python, these characters can be produced in string literals with escape sequences using the notation \xHH, with two hexadecimal digits. The fragment from the first image is probably equal to the following string:

"sich z\x01 B. irgendeine"

Your attempts to remove them were close. s = s.replace('\x01', '.') should work.

Upvotes: 2

Related Questions