How to escape unicode special chars in string and write it to UTF encoded file

Question

What I aim to achieve is to:

string like:

Bitte überprüfen Sie, ob die Dokumente erfolgreich in System eingereicht wurden, und löschen Sie dann die tatsächlichen Dokumente.

convert to:

'Bitte \u00FCberpr\u00FCfen Sie, ob die Dokumente erfolgreich in System eingereicht wurden, und l\u00F6schen Sie dann die tats\u00E4chlichen Dokumente.'

and write it in this form to file (which is UTF-8 encoded).

Jiř&#237; Baum · Accepted Answer

Another solution, not relying on the built-in repr() but rather implementing it from scratch:

orig = 'Bitte überprüfen Sie, ob die Dokumente erfolgreich in System eingereicht wurden, und löschen Sie dann die tatsächlichen Dokumente.'

enc = re.sub('[^ -~]', lambda m: '\u%04X' % ord(m[0]), orig)

print(enc)

Differences:

Encodes only using \u, never any other sequence, whereas repr() uses about a third of the alphabet (so for example the BEL character will be encoded as \u0007 rather than \a)
Upper-case encoding, as specified (\u00FC rather than \u00fc)
Does not handle unicode characters outside plane 0 (could be extended easily, given a spec for how those should be represented)
It does not take care of any pre-existing \u sequences, whereas repr() turns those into \u; could be extended, perhaps to encode \ as \u005C:
```
enc = re.sub(r'[^ -[\]-~]', lambda m: '\u%04X' % ord(m[0]), orig)
```

How to escape unicode special chars in string and write it to UTF encoded file

Answers (2)

Related Questions