Maddy
Maddy

Reputation: 1389

Error with utf-16 encoded data in python

Here's a snippet of code where a string is to be UTF-16 encoded and sent on the wire:

# -*- coding: utf8-*-

import unit_test_utils
import os
import sys

...
...
def run():
    test_dir = unit_test_utils.get_test_dir("test")

    try:
        file_name = u'débárquér.txt'
        open_req = createrequest.CreateRequest(factory)
        open_req.create_disp_ = defines.FILE_OPEN_IF
        open_req.file_name_ = '%s\\%s' % (test_dir, file_name)
        res = unit_test_utils.test_send(client, open_req)
        ....
        ....
    finally:
        client.close()

if __name__ == '__main__':
    run()

When this is run, the error is as follows:

# python /root/python/tests/unicode_test.py
Traceback (most recent call last):
  File "/root/python/tests/unicode_test.py", line 47, in <module>
    run()
  File "/root/python/tests/unicode_test.py", line 29, in run
    res = unit_test_utils.test_send(client, open_req)
  File "/root/python/unit_test_utils.py", line 336, in test_send
    handle_class=handle_class)
  File "/root/python/unit_test_utils.py", line 321, in test_async_send
    test_handle_class(handle_class, expected_status))
  File "/root/usr/lib/python2.7/site-packages/client.py", line 220, in async_send
    return self._async_send(msg, function, handle_class, pdu_splits)
  File "/root/usr/lib/python2.7/site-packages/client.py", line 239, in _async_send
    data, handle = self._handle_request(msg, function, handle_class)
  File "/root/usr/lib/python2.7/site-packages/client.py", line 461, in _handle_request
    return handler(self, msg, *args, **kwargs)
  File "/root/usr/lib/python2.7/site-packages/client.py", line 473, in _common_request
    msg.encode(buf, smb_ver=2)
  File "/root/usr/lib/python2.7/site-packages/message.py", line 17, in encode
    new_offset = composite.Composite.encode(self, buf, offset, **kwargs)
  File "/root/usr/lib/python2.7/site-packages/pycifs/composite.py", line 36, in encode
    new_offset = self._encode(buf, offset, **kwargs)
  File "/root/usr/lib/python2.7/site-packages/packets/createrequest.py", line 128, in _encode
    offset = self._file_name.encode(self._file_name_value(**kwargs), buf, offset, **kwargs)
  File "/root/usr/lib/python2.7/site-packages/fields/unicode.py", line 76, in encode
    buf.append(_UTF16_ENC(value)[0])
UnicodeDecodeError: 'ascii' codec can't decode byte 0xe9 in position 8: ordinal not in range(128)

What is wrong with the code?

When I tried this exercise locally, things seemed fine:

$ python
Python 2.6.6 (r266:84292, Jul 22 2015, 16:47:47)
[GCC 4.4.7 20120313 (Red Hat 4.4.7-16)] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> file_name = 'débárquér.txt'
>>> print type(file_name)
<type 'str'>
>>> utf16_filename = file_name.decode('utf8').encode('UTF-16LE')
>>> print type(utf16_filename)
<type 'str'>
>>> utf16_filename.decode('UTF-16LE')
u'd\xe9b\xe1rqu\xe9r.txt'

Upvotes: 0

Views: 2361

Answers (3)

Yaron
Yaron

Reputation: 10450

try to replace:

utf16_filename = file_name.decode('utf8').encode('UTF-16LE')

with

utf16_filename = unicode(file_name.decode('utf8')).encode('UTF-16LE')

Upvotes: -1

Mark Tolonen
Mark Tolonen

Reputation: 177971

When working with Unicode text, convert incoming byte strings to Unicode as soon as you can, work with Unicode text in the script, then convert back to byte strings as late as you can.

You've got a mix of byte strings in different encodings and the likely cause of trouble is this line:

open_req.file_name_ = '%s\\%s' % (test_dir, utf16_filename)

It is unclear what encoding test_dir is in, but the format string is an ASCII byte string, and utf16_filename is a UTF-16LE-encoded byte string. The result will be a mix of encodings.

Instead, determine what test_dir is, decode it to Unicode (if it is not), and use Unicode strings everywhere. Here's an example:

test_dir = unit_test_utils.get_test_dir("test")
# if not already Unicode, decode it...need to know encoding
test_dir = test_dir.decode(encoding)
file_name = u'débárquér.txt' # Unicode string!
open_req = createrequest.CreateRequest(factory)
open_req.create_disp_ = defines.FILE_OPEN_IF
# This would work...
# fullname = u'%s\\%s' % (test_dir, file_name)
# But better way to join is this...
fullname = os.path.join(test_dir,file_name)
# I assume UTF-16LE is required for "file_name_" at this point.
open_req.file_name_ = fullname.encode('utf-16le')
res = unit_test_utils.test_send(client, open_req)

Upvotes: 2

roeland
roeland

Reputation: 5751

Do not assign text to byte strings. In Python 2 that means you have to use unicode literals:

file_name = u'débárquér.txt'  # <-- unicode literal
utf16_filename = file_name.encode('UTF-16LE')

Then make sure you accurately declare the encoding of your source file.

Upvotes: 2

Related Questions