Reputation: 145
I'm a newbie, and I'm sure a similar question has been asked in the past, but I am having trouble finding/understanding an answer. Thank you in advance for being patient with me!
So I'm trying to write a script to read lines in a utf-8 encoded input file, compare portions of it to an optional command line argument passed in by the user, and if there's a match, to do some stuff to that line before printing it to an output file. I'm using codecs
to open the files.
I'm using the argparse
module to parse command line arguments right now. The lines in the file can be in all sorts of languages, hence the command line argument needs to also be utf-8.
For example:
A line from the file might look like this:
разъедают {. r ax z . j je . d ax1 . ju t .}
The script should be called from the command line with something like this:
>python myscript.py mytextfile.txt -grapheme ъ
Here's the part of my code that is supposed to do the processing. In this case, orth
is some Cyrillic text and grapheme
is a Cyrillic character.
def process_orth(orth, grapheme):
grapheme = grapheme.decode(sys.stdin.encoding).encode('utf-8')
if (grapheme in orth):
print 'success, your grapheme was: ' + grapheme.encode('utf-8')
return True
else:
print 'failure, your grapheme was: ' + grapheme.encode('utf-8')
return False
Unfortunately, even though the grapheme is definitely there, the function returns false and prints a question mark instead of the grapheme:
failure, your grapheme was: ?
I've tried adding the following at the start of process_orth()
as per the recommendation of some other post I read, but it didn't seem to work:
grapheme.decode(sys.stdin.encoding).encode('utf-8')
So my question is...
How do I pass utf-8 strings through the command line into a python script? Also, are there any extra quirks with this on Windows7 (and does having cygwin installed change anything)?
Upvotes: 0
Views: 1028
Reputation: 1121972
If you are opening the input file using codecs.open()
then you have unicode data, not encoded data. You would want to just decode grapheme
, not encode it again to UTF-8:
grapheme = grapheme.decode(sys.stdin.encoding)
if grapheme in orth:
print u'success, your grapheme was: ' + grapheme
return True
Note that we print unicode as well; normally print
will ensure that Unicode values are encoded again for your current codepage. This can still fail as Windows console printing is notoriously difficult, see http://wiki.python.org/moin/PrintFails.
Unfortunately, sys.argv
on Windows can apparently end up garbled, as Python uses a non-unicode aware system call. See Read Unicode characters from command-line arguments in Python 2.x on Windows for a unicode-aware alternative.
I see no reason for argparse
to have any problems with Unicode input, but if it does, you can always take the unicode output from win32_unicode_argv()
and encode it to UTF-8 before passing it to argparse
.
Upvotes: 3