personalt
personalt

Reputation: 850

Incorrect String Value for encoded characters - python insert in to my sql

I am parsing out a report form amazon, splits up lines into fields and then creates a mysql upload. Data I believe is originally iso-8859-1. Data uploads fine to mysql unless it has some special characters in it like an Ä or ®. If that happens I get an error like pymysql.err.InternalError: (1366, "Incorrect string value: '\\xAE Kids...' for column 'item-name' at row 74") & TypeError: can only concatenate str (not "bytearray") to str. I can hack around it by doing a replacing the bytes but I dont want to build a giant list plus I really want to store the proper values. I tried changing my mysql character sets and collations but that didnt seem to fix. I feel like the fix is a simple fix but i am already been trying things for a few hours.

report_as_dict = report.parsed
report_as_dict = report_as_dict.replace(b' \r\n', b'\r\n')  # remove black space at end

 multi_line_rebuild=list()
    for line in line_split[1:]:
        field_split = line.split(b'\t')
        logger.debug('Field Split : %s', field_split)
        field_split = [x.replace(b'\x92', b'') for x in field_split]  # removes single quotes
        field_split = [x.replace(b'\xA0', b'') for x in field_split]  # removes (
        field_split = [x.replace(b'\xAE', b'') for x in field_split]  # removes @
        field_split = [x.replace(b'\xCD', b'l') for x in field_split]  # replaces l with ' with l
        field_split = [x.replace(b'\xE4', b'a') for x in field_split]  # replaces a with two dots with a

        multi_line_rebuild.append(field_split)

 ....


 run_query_with_warnings(query_string, field_split=multi_line_rebuild)

Function

def run_query_with_warnings(warn_type, query_string, **kargs):

db = MySQLdb.connect(host=cred.host, user=cred.user, password=cred.password, db=cred.db, port=cred.port)
cursor = db.cursor()
cursor.executemany(query_string, kargs['field_split'])

Upvotes: 0

Views: 502

Answers (2)

Rick James
Rick James

Reputation: 142298

The client is working with latin1 encoding (92, etc). The table would like to have the utf8 encoding (E28099) for that "right single quotation mark". You can achieve that by telling MySQL that the client is using latin1 in the connection parameters, and having the column be utf8 (or utf8mb4).

The former is something like

db = MySQLdb.connect(host=DB_HOST, user=DB_USER, passwd=DB_PASS, db=DB_NAME,
              charset="utf8", use_unicode=True)

Also check whether or not you should change the beginning of your source to

# -*- coding: utf-8 -*-

But... I am worried. Are you really using right quote, registered sign (AE), I-acute, and a-double-dot? Or is this merely the beginning of some other mess? Sometimes multiple bytes in a row are 'bad'. To further analyze your situation, please get the hex for more than just one byte, and/or provide what characters you think the text should include.

Upvotes: 0

snakecharmerb
snakecharmerb

Reputation: 55669

The code is trying to write text encoded as ISO-8859-1 to tables set to expect UTF-8.

There are two solutions:

  • Set the charset argument on the connection to latin1 (this is the same as ISO-8859-1) and let the connection handle re-encoding the bytes to UTF-8

    db = MySQLdb.connect(host=cred.host, user=cred.user, password=cred.password, 
                         db=cred.db, port=cred.port, charset='latin1')
    
  • decode the encoded bytes to str and let the connection perform the encoding.

    report_as_dict = report.parsed.decode('ISO-8859-1')
    

If the code is doing nothing apart from writing the bytes directly to the database then the first option is fine; if the bytes are undergoing further manipulations then decoding to str will keep things simple.

Upvotes: 1

Related Questions