reynoldsnlp
reynoldsnlp

Reputation: 1210

More efficient way to make unicode escape codes

I am using python to automatically generate qsf files for Qualtrics online surveys. The qsf file requires unicode characters to be escaped using the \u+hex convention: 'слово' = '\u0441\u043b\u043e\u0432\u043e'. Currently, I am achieving this with the following expression:

'слово'.encode('ascii','backslashreplace').decode('ascii')

The output is exactly what I need, but since this is a two-step process, I wondered if there is a more efficient way to get the same result.

Upvotes: 3

Views: 660

Answers (2)

Neapolitan
Neapolitan

Reputation: 2163

If you open your output file as 'wb', then it accepts a byte stream rather than unicode arguments:

s = 'слово'
with open('data.txt','wb') as f:
    f.write(s.encode('unicode_escape'))
    f.write(b'\n')  # add a line feed

This seems to do what you want:

$ cat data.txt
\u0441\u043b\u043e\u0432\u043e

and it avoids both the decode as well as any translation that happens when writing unicode to a text stream.


Updated to use encode('unicode_escape') as per the suggestion of @J.F.Sebastian.

%timeit reports that it is quite a bit faster than encode('ascii', 'backslashreplace'):

In [18]: f = open('data.txt', 'wb')

In [19]: %timeit f.write(s.encode('unicode_escape'))
The slowest run took 224.43 times longer than the fastest. This could mean that an intermediate result is being cached.
100000 loops, best of 3: 1.55 µs per loop

In [20]: %timeit f.write(s.encode('ascii','backslashreplace'))
The slowest run took 9.13 times longer than the fastest. This could mean that an intermediate result is being cached.
100000 loops, best of 3: 2.37 µs per loop

In [21]: f.close()

Curiously, the lag from timeit for encode('unicode_escape') is a lot longer than that from encode('ascii', 'backslashreplace') even though the per loop time is faster, so be sure to test both in your environment.

Upvotes: 3

jfs
jfs

Reputation: 414585

I doubt that it is a performance bottleneck in your application but s.encode('unicode_escape') can be faster than s.encode('ascii', 'backslashreplace').

To avoid calling .encode() manually, you could pass the encoding to open():

with open(filename, 'w', encoding='unicode_escape') as file:
    print(s, file=file)

Note: it translates non-printable ascii characters too e.g., a newline is written as \n, tab as \t, etc.

Upvotes: 2

Related Questions