Reputation: 2586

shlex.split still not supporting unicode?

According to the documentation, in Python 2.7.3, shlex should support UNICODE. However, when running the code below, I get: UnicodeEncodeError: 'ascii' codec can't encode characters in position 184-189: ordinal not in range(128)

Am I doing something wrong?

import shlex

command_full = u'software.py -fileA="sequence.fasta" -fileB="新建文本文档.fasta.txt" -output_dir="..." -FORMtitle="tst"'

shlex.split(command_full)

The exact error is following:

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/opt/local/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/shlex.py", line 275, in split
    lex = shlex(s, posix=posix)
  File "/opt/local/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/shlex.py", line 25, in __init__
    instream = StringIO(instream)
UnicodeEncodeError: 'ascii' codec can't encode characters in position 44-49: ordinal not in range(128)

This is output from my mac using python from macports. I am getting exactly the same error on Ubuntu machine with "native" python 2.7.3.

Upvotes: 12

Answers (3)

tinyhare

Reputation: 2401

I use Python 2.7.16，and find that

shlex can work with common string 'xxxx'

ushlex can work with u'xxx'

# -*- coding:utf8 -*-
import ushlex
import  shlex

command_full1 = 'software.py -fileA="sequence.fasta" -fileB="新建文本文档.fasta.txt" -output_dir="..." -FORMtitle="tst"'
print shlex.split(command_full1)

command_full2 = u'software.py -fileA="sequence.fasta" -fileB="新建文本文档.fasta.txt" - output_dir="..." -FORMtitle="tst"'
print ushlex.split(command_full2)

out put:

['software.py', '-fileA=sequence.fasta', '-fileB=\xe6\x96\xb0\xe5\xbb\xba\xe6\x96\x87\xe6\x9c\xac\xe6\x96\x87\xe6\xa1\xa3.fasta.txt', '-output_dir=...', '-FORMtitle=tst']
[u'software.py', u'-fileA=sequence.fasta', u'-fileB=\u65b0\u5efa\u6587\u672c\u6587\u6863.fasta.txt', u'-output_dir=...', u'-FORMtitle=tst']

Upvotes: 0

Gringo Suave

Reputation: 31910

Actually there's been a patch for over five years. Last year I got tired of copying a ushlex around in every project and put it on PyPI:

https://pypi.python.org/pypi/ushlex/

Upvotes: 3

Martijn Pieters

Reputation: 1123460

The shlex.split() code wraps both unicode() and str() instances in a StringIO() object, which can only handle Latin-1 bytes (so not the full unicode codepoint range).

You'll have to encode (to UTF-8 should work) if you still want to use shlex.split(); the maintainers of the module meant that unicode() objects are supported now, just not anything outside the Latin-1 range of codepoints.

Encoding, splitting, decoding gives me:

>>> map(lambda s: s.decode('UTF8'), shlex.split(command_full.encode('utf8')))
[u'software.py', u'-fileA=sequence.fasta', u'-fileB=\u65b0\u5efa\u6587\u672c\u6587\u6863.fasta.txt', u'-output_dir=...', u'-FORMtitle=tst']

A now closed Python issue tried to address this, but the module is very byte-stream oriented, and no new patch has materialized. For now using iso-8859-1 or UTF-8 encoding is the best I can come up with for you.

Upvotes: 12

shlex.split still not supporting unicode?

Answers (3)

Related Questions