Reputation: 306
I'm working on a program that need to take two files and merge them and write the union file as a new one. The problem is that the output file contains chars like this \xf0
or if i change some of the encodings the result is something like that \u0028
. The input file are codificated in utf8. How can i print on the output file chars like "è"
or "ò"
and "-"
I have done this code:
import codecs
import pandas as pd
import numpy as np
goldstandard = "..\\files\file1.csv"
tweets = "..\\files\\file2.csv"
with codecs.open(tweets, "r", encoding="utf8") as t:
tFile = pd.read_csv(t, delimiter="\t",
names=['ID', 'Tweet'],
quoting=3)
IDs = tFile['ID']
tweets = tFile['Tweet']
dict = {}
for i in range(len(IDs)):
dict[np.int64(IDs[i])] = [str(tweets[i])]
with codecs.open(goldstandard, "r", encoding="utf8") as gs:
for line in gs:
columns = line.split("\t")
index = np.int64(columns[0])
rowValue = dict[index]
rowValue.append([columns[1], columns[2], columns[3], columns[5]])
dict[index] = rowValue
import pprint
pprint.pprint(dict)
ndic = pprint.pformat(dict, indent=4)
f = codecs.open("out.csv", "w", "utf8")
f.write(ndic)
f.close()
and this is example of the outputs
desired: Beyoncè
obtained: Beyonc\xe9
Upvotes: 2
Views: 232
Reputation: 1121914
You are producing Python string literals, here:
import pprint
pprint.pprint(dict)
ndic = pprint.pformat(dict, indent=4)
Pretty-printing is useful for producing debugging output; objects are passed through repr()
to make non-printable and non-ASCII characters easily distinguishable and reproducible:
>>> import pprint
>>> value = u'Beyonc\xe9'
>>> value
u'Beyonc\xe9'
>>> print value
Beyoncé
>>> pprint.pprint(value)
u'Beyonc\xe9'
The é
character is in the Latin-1 range, outside of the ASCII range, so it is represented with syntax that produces the same value again when used in Python code.
Don't use pprint
if you want to write out actual string values to the output file. You'll have to do your own formatting in that case.
Moreover, the pandas dataframe will hold bytestrings, not unicode
objects, so you still have undecoded UTF-8 data at that point.
Personally, I'd not even bother using pandas here; you appear to want to write CSV data, so I've simplified your code to use the csv
module instead, and I'm not actually bothering to decode the UTF-8 here (this is safe for this case as both input and output is entirely in UTF-8):
import csv
tweets = {}
with open(tweets, "rb") as t:
reader = csv.reader(t, delimiter='\t')
for id_, tweet in reader:
tweets[id_] = tweet
with open(goldstandard, "rb") as gs, open("out.csv", 'wb') as outf:
reader = csv.reader(gs, delimiter='\t')
writer = csv.reader(outf, delimiter='\t')
for columns in reader:
index = columns[0]
writer.writerow([tweets[index]] + columns[1:4] + [columns[5])
Note that you really want to avoid using dict
as a variable name; it masks the built-in type, I used tweets
instead.
Upvotes: 3