Python issues on character encoding

Question

I'm working on a program that need to take two files and merge them and write the union file as a new one. The problem is that the output file contains chars like this \xf0 or if i change some of the encodings the result is something like that \u0028. The input file are codificated in utf8. How can i print on the output file chars like "è" or "ò" and "-"

I have done this code:

import codecs
import pandas as pd
import numpy as np


goldstandard = "..\files\file1.csv"
tweets = "..\files\file2.csv"

with codecs.open(tweets, "r", encoding="utf8") as t:
    tFile = pd.read_csv(t, delimiter="	",
                        names=['ID', 'Tweet'],
                        quoting=3)

IDs = tFile['ID']
tweets = tFile['Tweet']

dict = {}
for i in range(len(IDs)):
    dict[np.int64(IDs[i])] = [str(tweets[i])]


with codecs.open(goldstandard, "r", encoding="utf8") as gs:
    for line in gs:
        columns = line.split("	")
        index = np.int64(columns[0])
        rowValue = dict[index]
        rowValue.append([columns[1], columns[2], columns[3], columns[5]])
        dict[index] = rowValue

import pprint
pprint.pprint(dict)
ndic = pprint.pformat(dict, indent=4)
f = codecs.open("out.csv", "w", "utf8")
f.write(ndic)
f.close()

and this is example of the outputs

   desired: Beyoncè
   obtained: Beyonc\xe9

Martijn Pieters · Accepted Answer

You are producing Python string literals, here:

import pprint
pprint.pprint(dict)
ndic = pprint.pformat(dict, indent=4)

Pretty-printing is useful for producing debugging output; objects are passed through repr() to make non-printable and non-ASCII characters easily distinguishable and reproducible:

>>> import pprint
>>> value = u'Beyonc\xe9'
>>> value
u'Beyonc\xe9'
>>> print value
Beyoncé
>>> pprint.pprint(value)
u'Beyonc\xe9'

The é character is in the Latin-1 range, outside of the ASCII range, so it is represented with syntax that produces the same value again when used in Python code.

Don't use pprint if you want to write out actual string values to the output file. You'll have to do your own formatting in that case.

Moreover, the pandas dataframe will hold bytestrings, not unicode objects, so you still have undecoded UTF-8 data at that point.

Personally, I'd not even bother using pandas here; you appear to want to write CSV data, so I've simplified your code to use the csv module instead, and I'm not actually bothering to decode the UTF-8 here (this is safe for this case as both input and output is entirely in UTF-8):

import csv

tweets = {}
with open(tweets, "rb") as t:
    reader = csv.reader(t, delimiter='	')
    for id_, tweet in reader:
        tweets[id_] = tweet

with open(goldstandard, "rb") as gs, open("out.csv", 'wb') as outf:
    reader = csv.reader(gs, delimiter='	')
    writer = csv.reader(outf, delimiter='	')
    for columns in reader:
        index = columns[0]
        writer.writerow([tweets[index]] + columns[1:4] + [columns[5])

Note that you really want to avoid using dict as a variable name; it masks the built-in type, I used tweets instead.

Python issues on character encoding

Answers (1)

Related Questions