Reading/Decoding UTF-8 Escape Characters into Native Characters

Question

I am using the unicodecsv drop-in module for Python 2.7 to read a CSV file containing columns of words in 28 different languages, some of which are accented and/or utilise completely different alphabet/character systems. I am loading the CSV

with open(sourceFile, 'rU') as keywordCSV:
    keywordList = csv.reader(keywordCSV, encoding='utf-8-sig', dialect=csv.excel)

but reading from keywordList is currently producing unicode escape characters/sequences rather than the native character symbols. Whilst this is not ideal (ideally I would be able to load the unicode in the csv as native character symbols from the start), it is acceptable so long as I can convert these into native character symbols later on in the script (when exporting to whichever file type will make this easiest). How is this, or preferably the ideal case, done? I have tried using workarounds such as these to no avail, and I am still not sure if this is an interpreter issue or an encoding issue within the script.

The reason I have used utf-8-sig when reading the file is that not doing so was resulting in a (BOM)

UnicodeEncodeError: 'ascii' codec can't encode character u'\ufeff' in position 155:

but this has now stopped happening for reasons unbeknown to me. Similarly, I am using 'rU' when opening the file as not doing so produces a

_csv.Error: new-line character seen in unquoted field - do you need to open the file in universal-newline mode?

but I am not sure if either of these are appropriate.

In this question, printing each character one by one results in the native characters being printed (something that also works in my code when run from the terminal), is there are a way of iterating through the characters and converting each one to its native character?

Apologies for posting another question on this already saturated topic, but I haven't been able to get other people's suggestions working for this case. Perhaps I have been looking in the wrong place in trying to decode the encoded csv output at the end of the script, and rather the problem is in my csv.reader's encoding. Any help will be very much appreciated.

Mark Tolonen · Accepted Answer

What you are seeing is the repr() of your Unicode characters. In Python 2.7, repr() only displays ASCII characters normally. Characters outside the ASCII range are displayed using escapes. This is for debugging purposes to make non-printing characters or characters not supported by the current code page visible. If you want to see the characters rendered, print them, but note that characters not supported by the terminal's configured code page may not work:

>>> s = u'\N{LATIN SMALL LETTER E WITH ACUTE}'
>>> s
u'\xe9'
>>> print repr(s)
u'\xe9'
>>> print s
é
>>> print unicode(s)
é

In the following case, the character isn't supported by the configured code page 437:

>>> s = u'\N{HORIZONTAL ELLIPSIS}'
>>> s
u'\u2026'
>>> print s
Traceback (most recent call last):
  File "", line 1, in 
  File "C:\dev\Python27\lib\encodings\cp437.py", line 12, in encode
    return codecs.charmap_encode(input,errors,encoding_map)
UnicodeEncodeError: 'charmap' codec can't encode character u'\u2026' in position 0: character maps to

Reading/Decoding UTF-8 Escape Characters into Native Characters

Answers (1)

Related Questions