user610650
user610650

Reputation:

How can strings with non-ASCII characters be retrieved with OptParse?

I'm using the OptParse module to retrieve a string value. OptParse only supports str typed strings, not unicode ones.

So let's say I start my script with:

./someScript --some-option ééééé

French characters, such as 'é', being typed str, trigger UnicodeDecodeErrors when read in the code:

UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 99: ordinal not in range(128)

I played around a bit with the unicode built-in function, but either I get an error, or the character disappears:

>>> unicode('é');
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 0: ordinal not in range(128)
>>> unicode('é', errors='ignore');
u''

Is there anything I can do to use OptParse to retrieve unicode/utf-8 strings?

It seems that the string can be retrieved and printed OK, but then I try to use that string with SQLite (using the APSW module), and it tries to convert to unicode somehow with cursor.execute("..."), and then the error occurs.

Here is a sample program that causes the error:

#!/usr/bin/python
# coding: utf-8

import os, sys, optparse
parser = optparse.OptionParser()
parser.add_option("--some-option")
(opts, args) = parser.parse_args()
print unicode(opts.some_option)

Upvotes: 6

Views: 2118

Answers (4)

lionyue
lionyue

Reputation: 1

#!/usr/bin/python
# coding: utf-8

import os, sys, optparse

reload(sys)
sys.setdefaultencoding('utf-8')

parser = optparse.OptionParser()
parser.add_option(u"--some-option")
(opts, args) = parser.parse_args()
print opts.print_help()

Upvotes: 0

Mark Tolonen
Mark Tolonen

Reputation: 178115

Input is returned in the console encoding, so based on your updated example, use:

print opts.some_option.decode(sys.stdin.encoding)

unicode(opts.some_option) defaults to using ascii as the encoding.

Upvotes: 1

jro
jro

Reputation: 9484

You could decode the arguments before the parser handles them. Taking your example:

#!/usr/bin/python
# coding: utf-8
import os, sys, optparse
parser = optparse.OptionParser()
parser.add_option("--some-option")

# Decode the command line arguments to unicode
for i, a in enumerate(sys.argv):
    sys.argv[i] = a.decode('ISO-8859-15')

(opts, args) = parser.parse_args()
print type(opts.some_option), opts.some_option

This gives the following output:

C:\workspace>python file.py --some-option préférer
<type 'unicode'> préférer

I've chose the ISO/IEC 8859-15 code page, as it seems most appropriate to you. Adapt if needed.

Upvotes: 4

Woot4Moo
Woot4Moo

Reputation: 24336

I believe your error is related to the following:

For example, to write Unicode literals including the Euro currency symbol, the ISO-8859-15 encoding can be used, with the Euro symbol having the ordinal value 164. This script will print the value 8364 (the Unicode codepoint corresponding to the Euro symbol) and then exit:

# -*- coding: iso-8859-15 -*-

currency = u"€"
print ord(currency)

Upvotes: 0

Related Questions