frank
frank

Reputation: 1303

Python unwanted UnicodeDecodeError exception from one entry in list comprehension

I am using Python 2.6 on Linux. I have a shift_jis (Japanese) encoded .csv file that I am loading. I am reading the header in, and doing a regex replacement to translate a few values, then writing the file back as shift_jis. I am hitting a UnicodeDecodeError on one of the characters in the file, ①, which should be a valid character according to http://www.rikai.com/library/kanjitables/kanji_codes.sjis.shtml. The other Japanese characters decode fine.

1) I am decoding the string using shift_jis in a list comprehension. What can I do if I want to just ignore (workaround) this and other bad characters? Here is the code with the csv values already read in list_of_row_values.

#! /usr/bin/python
# -*- coding: utf8 -*-

import csv
import re

with open('test.csv', 'wb') as output_file:
    wr = csv.writer(output_file, delimiter=',', quoting=csv.QUOTE_NONE) 

    # the following corresponds to reading from a shift_jis encoded csv files "日付,直流電流計測①,直流電流計測②"
    # 直流電流計測① is throwing an exception when decoded but it is a valid character according to
    # http://www.rikai.com/library/kanjitables/kanji_codes.sjis.shtml                           
    list_of_row_values = ['\x93\xfa\x95t', '\x92\xbc\x97\xac\x93d\x97\xac\x8cv\x91\xaa\x87@', '\x92\xbc\x97\xac\x93d\x97\xac\x8cv\x91\xaa\x87A']            

    # take away the last character in entry two, and three, and it would work 
    # but that means I know all the bad characters before hand
    #list_of_row_values = ['\x93\xfa\x95t', '\x92\xbc\x97\xac\x93d\x97\xac\x8cv\x91\xaa', '\x92\xbc\x97\xac\x93d\x97\xac\x8cv\x91\xaa']

    try:
        list_of_unicode_row_values = [str.decode('shift_jis') for str in list_of_row_values]                    
    except UnicodeDecodeError:
        # Question: what if I want to just ignore the character that cannot be decoded and still get the list
        # of "日付,直流電流計測,直流電流計測" as unicode?
        # right now, list_of_unicode_row_values would remain undefined, and the next line will
        # have a NameError
        print 'UnicodeDecodeError'
        pass

    # do a regex explanation to translate one column heading value
    list_of_translated_unicode_row_values = \
    [re.sub('日付'.decode('utf-8'), 'Date Time', str) for str in list_of_unicode_row_values]          

    list_of_translated_row_values = [unicode_str.encode('shift_jis') for unicode_str in list_of_translated_unicode_row_values]
    wr.writerow(list_of_translated_row_values)

2) On a side note, how should I report this Python bug that a particular shift_jis character seems to fail to be properly decoded?

Upvotes: 1

Views: 215

Answers (1)

nneonneo
nneonneo

Reputation: 179592

In general, you can use errors='ignore' to skip over invalid characters:

list_of_unicode_row_values = [str.decode('shift_jis', errors='ignore') for str in list_of_row_values]

This results in the following entries in list_of_unicode_row_values:

日付
直流電流計測
直流電流計測

However, in your particular case, you are using the wrong encoding. Python's shift_jis encoding conforms to the JIS X 0208 standard, while the character ① exists in the newer JIS X 0213 standard. To use the latter, just use the shift_jisx0213 encoding:

list_of_unicode_row_values = [str.decode('shift_jisx0213') for str in list_of_row_values]

You will get the following entries:

日付
直流電流計測①
直流電流計測②

as expected.

Upvotes: 3

Related Questions