Phil K.
Phil K.

Reputation: 53

Convert string containing "b'…'" to unicode

I try now for several hours to find a solution for this problem. I need to read in a generated CSV file that has the headers of the columns in a format like this:

"b'Device Name' (b'')"

or even

"b'Bezugsz\\xc3\\xa4hler' (b'Wh')"

I want to convert these strings to Unicode. However, until now I'm out of luck. All examples with encode or decode I found so far didn't lead in a useful direction. I need to get rid of the b'…' part as well as the \x escapes.

I hope someone here has some useful information. :)

edit: as requested the desired output:

"Device Name ()"
"Bezugszähler (Wh)"

the first case is easy to achieve with replace(). But I look for a solution for the second case, which would then naturally include the first case.

I tried solutions with ast.literal_eval() but this chokes on the parentheses. Solutions with .encode().decode() also did not work as expected.

Upvotes: 1

Views: 181

Answers (1)

wjandrea
wjandrea

Reputation: 33107

Here's a quick and dirty way to do this:

  1. Use regex to find the faux-bytes
  2. Use ast.literal_eval() to convert them to actual bytes
  3. Decode bytes to strings
  4. Insert back into a template
headers = [
    "b'Device Name' (b'')",
    "b'Bezugsz\\xc3\\xa4hler' (b'Wh')"]

# ---

import ast
import re

def f(string):
    faux_bytes = re.findall(r"b'.*?'", string)
    real_bytes = [ast.literal_eval(f) for f in faux_bytes]
    decoded = [s.decode() for s in real_bytes]
    return '{} ({})'.format(*decoded)

result = [f(h) for h in headers]
print(result)

Output:

['Device Name ()', 'Bezugszähler (Wh)']

Upvotes: 1

Related Questions