More efficient way to make unicode escape codes

Question

I am using python to automatically generate qsf files for Qualtrics online surveys. The qsf file requires unicode characters to be escaped using the \u+hex convention: 'слово' = '\u0441\u043b\u043e\u0432\u043e'. Currently, I am achieving this with the following expression:

'слово'.encode('ascii','backslashreplace').decode('ascii')

The output is exactly what I need, but since this is a two-step process, I wondered if there is a more efficient way to get the same result.

Neapolitan · Accepted Answer

If you open your output file as 'wb', then it accepts a byte stream rather than unicode arguments:

s = 'слово'
with open('data.txt','wb') as f:
    f.write(s.encode('unicode_escape'))
    f.write(b'
')  # add a line feed

This seems to do what you want:

$ cat data.txt
\u0441\u043b\u043e\u0432\u043e

and it avoids both the decode as well as any translation that happens when writing unicode to a text stream.

Updated to use encode('unicode_escape') as per the suggestion of @J.F.Sebastian.

%timeit reports that it is quite a bit faster than encode('ascii', 'backslashreplace'):

In [18]: f = open('data.txt', 'wb')

In [19]: %timeit f.write(s.encode('unicode_escape'))
The slowest run took 224.43 times longer than the fastest. This could mean that an intermediate result is being cached.
100000 loops, best of 3: 1.55 µs per loop

In [20]: %timeit f.write(s.encode('ascii','backslashreplace'))
The slowest run took 9.13 times longer than the fastest. This could mean that an intermediate result is being cached.
100000 loops, best of 3: 2.37 µs per loop

In [21]: f.close()

Curiously, the lag from timeit for encode('unicode_escape') is a lot longer than that from encode('ascii', 'backslashreplace') even though the per loop time is faster, so be sure to test both in your environment.

More efficient way to make unicode escape codes

Answers (2)

Related Questions