RexSolus
RexSolus

Reputation: 143

(Python) Write string to CSV with UTF-8 Encoding

I have data in the database and I want to export it to CSV file. The data is in Japanese and for further use, I need it to be encoded in UTF-8

This is my script to get the data and write in the CSV file

import mysql.connector
from mysql.connector import errorcode

import sys
import csv

query = 'SELECT * FROM `images-data`'

try:
    cnx = mysql.connector.connect(user='root', password='1234',
                                 host='127.0.0.1',
                                 database='sotsuken-test-db')
    cur=cnx.cursor()
    cur.execute(query)
    result=cur.fetchall()

    c = csv.writer(open("db-data.csv","w"))
    for row in result:
        c.writerow(row)



except mysql.connector.Error as err:
    if err.errno == errorcode.ER_ACCESS_DENIED_ERROR:
        print("Something is wrong with your user name or password")
    elif err.errno == errorcode.ER_BAD_DB_ERROR:
        print("Database does not exist")
    else:
        print(err)
else:
    cnx.close()

I can create the CSV file, and the data is exported in UTF-8, but my CSV file's data was like this:

1,b'\xe3\x83\x86\xe3\x82\xb9\xe3\x83\x88\xe3\x81\xa7\xe3\x81\x94\xe3\x81\x8a\xe3\x81\x96\xe3\x81\x84\xe3\x81\xbe\xe3\x81\x99'

In my research I found that my data was written in byte object (may be I'm wrong here). I need the data to be written in UTF-8 without b''

I know that I can use decode("UTF-8") to make it into the string, but I cannot apply that to csv.writerow

Can you give me any suggestions?

Upvotes: 3

Views: 2420

Answers (1)

abarnert
abarnert

Reputation: 365737

csv.writer.writerow just takes a list of whatever you've got,1 calls str on each element, and puts them together into a line of CSV.

So, if what you've got is bytes objects, then it's going to call str on them and get strings like b'\xe3\x83\x86\xe3\x82\xb9\xe3\x83\x88\xe3\x81\xa7\xe3\x81\x94\xe3\x81\x8a\xe3\x81\x96\xe3\x81\x84\xe3\x81\xbe\xe3\x81\x99'.

The solution is to give it strings.


The simplest way to do it is by decoding those bytes:

I know that I can use decode("UTF-8") to make it into the string, but I cannot apply that to csv.writerow

Sure you can. I don't know what your columns are, so I'll make up my own example:

for image_id, image_name in result:
    c.writerow([image_id, image_name.decode('UTF-8')])

That's really all there is to it.


What if your code has absolutely no idea what the columns in the database are? Well, in that case, you have to get a little bit clever. As long as we know that all of the bytes columns are really Unicode text in disguise as UTF-8, you can just switch on the type:

for row in result:
    textrow = [col.decode('UTF-8') if isinstance(col, bytes) else col for col in row]
    csv.writerow(textrow)

That's a bit ugly, but reading * from a database where you have no idea what the columns are is inherently an ugly problem.


However, there's probably a much better solution.

MySQL Connector/Python turns all CHAR, TEXT, and similar columns into Unicode str values by default.2 But BINARY, BLOB, and similar columns are always returned as bytes.

If this column is meant to represent Unicode text, set it to a text type instead of a binary one in the database. Then you won't have a problem in the first place—in this script, or in any other tools.


1. Actually, it appears to be illegal to call it with a list of anything but strings and numbers, according to the docs. But practically, it takes anything at all; it just doesn't do anything very useful with any types besides strings and numbers…

2. That is, unless you explicitly pass it a use_unicode=False in the connect call or similarly set it elsewhere.

Upvotes: 2

Related Questions