Reputation: 1210
I am using python to automatically generate qsf
files for Qualtrics online surveys. The qsf
file requires unicode characters to be escaped using the \u+hex
convention: 'слово' = '\u0441\u043b\u043e\u0432\u043e'. Currently, I am achieving this with the following expression:
'слово'.encode('ascii','backslashreplace').decode('ascii')
The output is exactly what I need, but since this is a two-step process, I wondered if there is a more efficient way to get the same result.
Upvotes: 3
Views: 660
Reputation: 2163
If you open your output file as 'wb', then it accepts a byte stream rather than unicode arguments:
s = 'слово'
with open('data.txt','wb') as f:
f.write(s.encode('unicode_escape'))
f.write(b'\n') # add a line feed
This seems to do what you want:
$ cat data.txt
\u0441\u043b\u043e\u0432\u043e
and it avoids both the decode as well as any translation that happens when writing unicode to a text stream.
Updated to use encode('unicode_escape') as per the suggestion of @J.F.Sebastian.
%timeit reports that it is quite a bit faster than encode('ascii', 'backslashreplace'):
In [18]: f = open('data.txt', 'wb')
In [19]: %timeit f.write(s.encode('unicode_escape'))
The slowest run took 224.43 times longer than the fastest. This could mean that an intermediate result is being cached.
100000 loops, best of 3: 1.55 µs per loop
In [20]: %timeit f.write(s.encode('ascii','backslashreplace'))
The slowest run took 9.13 times longer than the fastest. This could mean that an intermediate result is being cached.
100000 loops, best of 3: 2.37 µs per loop
In [21]: f.close()
Curiously, the lag from timeit for encode('unicode_escape') is a lot longer than that from encode('ascii', 'backslashreplace') even though the per loop time is faster, so be sure to test both in your environment.
Upvotes: 3
Reputation: 414585
I doubt that it is a performance bottleneck in your application but s.encode('unicode_escape')
can be faster than s.encode('ascii', 'backslashreplace')
.
To avoid calling .encode()
manually, you could pass the encoding to open()
:
with open(filename, 'w', encoding='unicode_escape') as file:
print(s, file=file)
Note: it translates non-printable ascii characters too e.g., a newline is written as \n
, tab as \t
, etc.
Upvotes: 2