systemdebt
systemdebt

Reputation: 4941

Read csv with boxed question marks

I have CSV file(in French) that has rows of text that look like :

"Vend, 21 sept, 2018","43326370894332743328177832888443325333815370","NX","651-2141652-1309NON666-3778692-2229581-300-6525622-9439NON581-998-8765827-3937STOPNON653-2541Toronto","RoyRoyHoudeOuelletFecteauRenaudBergeronLeclercBadeaux","Louise-AndréeAndréRichardAlexandraPaulineElianeCharles-EugèneGuyJacqueline","Vendredi, 21 septembre, 2018","","","3","37089","","100","","204-7584","MIller ","claudia","8:30 pt ne s'est pas présenté (gastro) veut un autre rdv","370892192018","581-309-1309660-3064fille254-6560cel650-4556"

I read it in Python using following code:

import csv
filepath = 'RDV.csv'
try:
    with open(filepath, 'rU') as file:
        try:
            reader = csv.reader(x.replace('\0', '') for x in file)
            for row in reader:
                try:
                    print(row)
                except Exception as ee:
                    print ee
        except Exception as eee:
            print eee
except Exception as e:
    print e

It is read like:

['Vend, 21 sept, 2018', '43326\x1d\x1d37089\x1d\x1d43327\x1d43328\x1d17783\x1d28884\x1d\x1d\x1d43325\x1d33381\x1d\x1d\x1d\x1d\x1d\x1d\x1d\x1d\x1d\x1d\x1d\x1d\x1d\x1d\x1d\x1d5370', '\x1d\x1d\x1d\x1dNX', '651-2141\x1d\x1d652-1309\x1dNON\x1d666-3778\x1d692-2229\x1d581-300-6525\x1d622-9439\x1d\x1dNON\x1d581-998-8765\x1d827-3937\x1dSTOP\x1d\x1d\x1d\x1d\x1d\x1d\x1d\x1d\x1d\x1d\x1d\x1d\x1d\x1dNON\x1d653-2541\x1d\x1d\x1dToronto', 'Roy\x1d\x1dRoy\x1d\x1dHoude\x1dOuellet\x1dFecteau\x1dRenaud\x1d\x1d\x1dBergeron\x1dLeclerc\x1d\x1d\x1d\x1d\x1d\x1d\x1d\x1d\x1d\x1d\x1d\x1d\x1d\x1d\x1d\x1dBadeaux', 'Louise-Andr\x8ee\x1d\x1dAndr\x8e\x1d\x1dRichard\x1dAlexandra\x1dPauline\x1dEliane\x1d\x1d\x1dCharles-Eug\x8fne\x1dGuy\x1d\x1d\x1d\x1d\x1d\x1d\x1d\x1d\x1d\x1d\x1d\x1d\x1d\x1d\x1d\x1dJacqueline', 'Vendredi, 21 septembre, 2018', '', '', '3', '37089', '', '100', '', '204-7584', 'MIller ', 'claudia', "8:30 pt ne s'est pas pr\x8esent\x8e (gastro) veut un autre rdv\x0b", '370892192018', '\x1d\x1d581-309-1309\x1d\x1d\x1d\x1d660-3064fille\x1d254-6560cel\x1d\x1d\x1d\x1d\x1d\x1d\x1d\x1d\x1d\x1d\x1d\x1d\x1d\x1d\x1d\x1d\x1d\x1d\x1d\x1d650-4556']
  1. How can it be read as plain text instead of those encoded characters?
  2. How can I look for character � in values - for example :

    Louise-Andr�eAndr�RichardAlexandraPaulineElianeCharles-Eug�neGuyJacqueline

Edit:

I tried code from snakecharmerb's answer but I get following error:

Traceback (most recent call last):
  File "<input>", line 20, in <module>
  File "<input>", line 9, in unicode_csv_reader
  File "<input>", line 15, in utf_8_encoder
  File "/Users/simran/Documents/abc/venv/lib/python2.7/codecs.py", line 701, in next
    return self.reader.next()
  File "/Users/simran/Documents/abc/venv/lib/python2.7/codecs.py", line 632, in next
    line = self.readline()
  File "/Users/simran/Documents/abc/venv/lib/python2.7/codecs.py", line 547, in readline
    data = self.read(readsize, firstline=True)
  File "/Users/simran/Documents/abc/venv/lib/python2.7/codecs.py", line 494, in read
    newchars, decodedbytes = self.decode(data, self.errors)
  File "/Users/simran/Documents/abc/venv/lib/python2.7/encodings/utf_16.py", line 112, in decode
    raise UnicodeError,"UTF-16 stream does not start with BOM"
UnicodeError: UTF-16 stream does not start with BOM

Upvotes: 0

Views: 2240

Answers (1)

snakecharmerb
snakecharmerb

Reputation: 55799

The file is probably encoded as UTF-16.

>>> s = '"Vend, 21 sept, 2018","43326370894332743328177832888443325333815370","NX","651-2141652-1309NON666-3778692-2229581-300-6525622-9439NON581-998-8765827-3937STOPNON653-2541Toronto","RoyRoyHoudeOuelletFecteauRenaudBergeronLeclercBadeaux","Louise-AndréeAndréRichardAlexandraPaulineElianeCharles-EugèneGuyJacqueline","Vendredi, 21 septembre, 2018","","","3","37089","","100","","204-7584","MIller ","claudia","8:30 pt ne s\'est pas présenté (gastro) veut un autre rdv","370892192018","581-309-1309660-3064fille254-6560cel650-4556"'
>>> buf = io.BytesIO(s.decode('utf-8').encode('utf-16'))
>>> next(csv.reader(buf))
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
 _csv.Error: line contains NULL byte

Python2's csv module doesn't handle UTF-16, and neither does the unicodecsv package. However we can amend the unicode_csv_reader from the examples in the docs:

import codecs
import csv 


def unicode_csv_reader(unicode_csv_data, dialect=csv.excel, **kwargs):
    # csv.py doesn't do Unicode; encode temporarily as UTF-8:
    csv_reader = csv.reader(utf_8_encoder(unicode_csv_data),
                            dialect=dialect, **kwargs)
    for row in csv_reader:
        # decode UTF-8 back to Unicode, cell by cell:
        yield [unicode(cell, 'utf-8') for cell in row]


def utf_8_encoder(unicode_csv_data):
    for line in unicode_csv_data:
        yield line.encode('utf-8')


with codecs.open('french2.csv', 'rU', encoding='utf-16') as f:
    for row in unicode_csv_reader(f):
        for cell in row:
            print cell

The code produces this output (one cell printed per line just to show the accented characters):

Vend, 21 sept, 2018
43326370894332743328177832888443325333815370
NX
651-2141652-1309NON666-3778692-2229581-300-6525622-9439NON581-998-8765827-3937STOPNON653-2541Toronto
RoyRoyHoudeOuelletFecteauRenaudBergeronLeclercBadeaux
Louise-AndréeAndréRichardAlexandraPaulineElianeCharles-EugèneGuyJacqueline
Vendredi, 21 septembre, 2018


3
37089

100

204-7584
MIller 
claudia
8:30 pt ne s'est pas présenté (gastro) veut un autre rdv
370892192018
581-309-1309660-3064fille254-6560cel650-4556

None of this would be necessary in Python3, you could just do:

with open(myfile, 'r', newline='', encoding='utf-16') as f:
    reader = csv.reader(f)
    for row in reader:
        ...

Review

Identifying the encoding

Guessing an unknown encoding is a problem without a general solution. In this case, we know that the encoded text contains null bytes, and that removing the null bytes leaves a hex escape where we expect to see an accented European character, but unaccented European characters are unchanged. This is enough evidence to suggest that the file may be encoded as UTF-16; For characters in the ASCII range, UTF-16 encoding effectively prepends or appends a null byte to the ASCII character.

>>> u = u'André'
>>> s = u.encode('utf-16-le')
>>> s
'A\x00n\x00d\x00r\x00\xe9\x00'

UTF-16 encoding can be big-endian or little-endian; the endianness determines whether the null byte precedes or follows the ASCII character. The bytes may include a byte order mark (BOM) that indicates the endianness; in this case, the encoding may be specified as UTF-16 and Python will select the correct encoding. In the absence of a BOM, utf-16-le or utf-16-be must be specified explicitly.

The � character ('\uffd') is the unicode replacement character, and is used to render characters that cannot be displayed in a chosen encoding (assuming the errors argument to str.encode is set to 'replace', either explicitly or implicitly)

>>> print s
Andr�
Reading the csv

Python 2's csv module does not handle non-ASCII encodings well. To get around it's limitations, the code above

  • decodes the file contents from utf-16 to unicode
  • re-encodes as utf-8 (to avoid null bytes)
  • decodes each cell's contents from utf-8 to unicode

Once the content has been returned to the program as unicode it can be processed without issue, until it is encoded for writing to a file or printing.

In Python 3, handling non-ASCII text is much simpler: this code would do all the work:

with open('french.csv', newline='', encoding='utf-16') as f:
    reader = csv.reader(f)
    for row in f:
       print(row)

Upvotes: 1

Related Questions