alexsc
alexsc

Reputation: 1216

Python MySQL CSV export to json strange encoding

I received a csv file exported from a MySQL database (I think the encoding is latin1 since the language is spanish). Unfortunately the encoding is wrong and I cannot process it at all. If I use file:

$ file -I file.csv file.csv: text/plain; charset=unknown-8bit

I have tried to read the file in python and convert it to utf-8 like:

r.decode('latin-1').encode("utf-8")

or using mysql_latin1_codec:

r.decode('mysql_latin1').encode('UTF-8')

I am trying to transform the data into json objects. The error comes when I save the file:

'UnicodeEncodeError: 'ascii' codec can't encode characters in position'

Do you know how can I convert it to normal utf-8 chars? Or how can I convert data to a valid json? Thanks!!

Upvotes: 8

Views: 960

Answers (3)

Beatriz Kanzki
Beatriz Kanzki

Reputation: 69

I got really good results by using pandas dataframe from Continuum Analytics.

You coud do something like:

import pandas as pd
from pandas import *

con='Your database connection credentials user, password, host, database to use'
data=pd.read_sql_query('SELECT * FROM YOUR TABLE',conn=con)

Then you could do:

data.to_csv('path_with_file_name')

or to convert to JSON:

data.to_json(orient='records')

or if you prefer to customize your json format see the documentation here: http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.to_json.html

Upvotes: 1

Rick James
Rick James

Reputation: 142528

You probably have two problems. But let's back off... We can't tell whether the text was imported incorrectly, exported incorrectly, or merely displayed in a goofy way.

First, I am going to discuss "importing"...

Do not try to alter the encoding. Instead live with the encoding. But first, figure out what the encoding is. It could be latin1 or it could be utf8. (Or any of lots of less likely charsets.)

Find out the hex for the incoming file. In Python, the code is something like this for dumping hex (etc) for string u:

for i, c in enumerate(u):
    print i, '%04x' % ord(c), unicodedata.category(c),
    print unicodedata.name(c)

You can go here to see a list of hex values for all the latin1 characters, together with the utf8 hex. For example, ó is latin1 F3 or utf8 C2B3.

Now, armed with knowing the encoding, tell MySQL that.

LOAD DATA INFILE ...
    ...
    CHARACTER SET utf8  -- or latin1
    ...;

Meanwhile, it does not matter what CHARACTER SET ... the table or column is defined to be; mysql will transcode if necessary. All Spanish characters are available in latin1 and utf8.

Go to this Q&A .

I suggested that you have two errors, one is the "black diamond" case mentioned there; there other is something else. But... Follow the "Best Practice" mentioned.

Back to you question of "exporting"...

Again, you need to check the hex of the output file. Again it does not matter whether it is latin1 or utf8. However... If the hex is C383C2B3 for simply ó, you have "double encoding". If you have that, check to see that you have removed any manual conversion function calls, and simply told MySQL what's what.

Here are some more utf8+Python tips you might need.

If you need more help, follow the text step-by-step. Show us the code used to move/convert it at each step, and show us the HEX at each step.

Upvotes: 0

Adam Henderson
Adam Henderson

Reputation: 308

Have you tried using the codecs module?:

import codecs
....
codecs.EncodedFile(r, 'latin1').reader.read()

I remember having a similar issue a while back and the answer was something to do with how encoding was done prior to Python 3. Codecs seems to handle this problem relatively elegantly.

As coder mentioned in the question comments, it's difficult to pinpoint the problem without being able to reproduce it so I may be barking up the wrong tree.

Upvotes: 0

Related Questions