Matt D
Matt D

Reputation: 131

Python3: dealing with UTF8 incompatible characters in CSV output

I'm on Python3.2 and have an SQL output I'm writing to CSV files with a 'Name' identifier and a 'specifics'. For some data from China, people's names (and thus Chinese characters) are being inserted. I've done my best to read through the unicode/decoding docs but I'm at a loss at how to reform/remove these characters holistically in-line within my Python.

I'm running through the file like so:

import csv, os, os.path
rfile = open(nonbillabletest2.csv,'r',newline='')
dataread= csv.reader(rfile)
trash=next(rfile) #ignores the header line in csv:

#Process the target CSV by creating an output with a unique filename per CompanyName
for line in dataread:
    [CompanyName,Specifics] = line
    #Check that a target csv does not exist
    if os.path.exists('test leads '+CompanyName+'.csv') < 1:
        wfile= open('test leads '+CompanyName+'.csv','a')
        datawrite= csv.writer(wfile, lineterminator='\n')
        datawrite.writerow(['CompanyName','Specifics']) #write new header row in each file created
        datawrite.writerow([CompanyName,Specifics])
wfile.close()    
rfile.close()

I receive this error:

Traceback (most recent call last):
  File "C:\Users\Matt\Dropbox\nonbillable\nonbillabletest.py", line 26, in <module>
    for line in dataread:
  File "C:\Python32\lib\encodings\cp1252.py", line 23, in decode
    return codecs.charmap_decode(input,self.errors,decoding_table)[0]
UnicodeDecodeError: 'charmap' codec can't decode byte 0x90 in position 1886: character maps to <undefined>

Examining the file contents, clearly some non-UTF8 characters:

print(repr(open('nonbillabletest2.csv', 'rb').read()))

b'CompanyName,Specifics\r\neGENTIC,\x86\xac\xff; \r\neGENTIC,\x86\xac\xff; \r\neGENTIC,
\x86\xac\xff; \r\neGENTIC,\x91\x9d?; \r\neGENTIC,\x86\xac\xff; \r\n'

Incorporating a 'encoding=utf8' does not resolve the issue. I have been able to remove individual characters with ...replace('\x86\xac\xff', '')), but I'd have to do this for every character I could encouter, which isn't efficient.

If there's a SQL solution that would be fine, too. Please help!


Update: I've removed the characters using string.printable as was suggested. I had one more error because there was always one final row in the 'contents' section. Adding a if len=0 check took care of that, however.

Thanks a lot for all your fast help!

Upvotes: 0

Views: 1803

Answers (1)

sjbrown
sjbrown

Reputation: 570

So nonbillabletest2.csv is not encoded in UTF-8.

You could:

  1. Fix it upstream. Ensure that it comes to you properly encoded as UTF-8, like you expect. This may be the "SQL solution" you refer to.
  2. Remove all the non-ascii characters beforehand (which, for purists, corrupts the data, but by what you've said, it seems like that's acceptable to you)

    import csv, os, string
    rfile = open('nonbillabletest2.csv', 'rb')
    rbytes = rfile.read()
    rfile.close()
    
    contents = ''
    for b in rbytes:
      if chr(b) in string.printable + string.whitespace:
        contents += chr(b)
    
    dataread = csv.reader(contents.split('\r\n'))
    ....
    

Upvotes: 1

Related Questions