HenryAD
HenryAD

Reputation: 87

How to avoid b' and UTF-8 literals in MySQL using Python 3

I'm a bit of a novice. So I'm trying to parse html pages and put the contents in a column in MySQL, however I can't seem to get the actual foreign characters to appear, such as instead of á I get xc3xa1. My table has utf8mb4 as its character set and collation utf8mb4_unicode_ci. I have the following set-up:

Database_cnx = pymysql.connect(user='XXXX', password='XXXX',
                              host='XXXX',
                              database='XXXX',
                              use_unicode=True,
                              charset='utf8mb4')

article_content = str(row[3].encode("utf-8")).replace("'", "\'").replace("\"", "\'")

q_i = ("INSERT INTO article_items (" + ", ".join(article_table_col_name_new) + ")"
"VALUES ({:d}, \"{:s}\", \"{:s}\", \"{:s}\", \"{:s}\", \"{:s}\", \"{:s}\")".format(row[0], urlparse(row[1]).netloc, row[1], row[2].replace("\"", "'"), article_content, datetime.fromtimestamp(row[4]).strftime("%Y-%m-%d"), updated)
)

So how can I have it that only the actual article content appears in my column, instead of the b' byte and the utf-8 literals. thanks

Upvotes: 1

Views: 231

Answers (2)

Rick James
Rick James

Reputation: 142366

instead of á I get xc3xa1

That says that the problem was on inserting the data into the database. Go back to that code and let's discuss it.

(That is probably \xc3\xa1, but the backslashes got lost somewhere.) C3A1 is the hex for the UTF-8 encoding of á.

You can double check things by getting the string length -- it should be 1 (characters) or 2 (bytes) for á, but is apparently 8 for \xc3\xa1.

Upvotes: 0

abarnert
abarnert

Reputation: 365915

The problem is that you're explicitly encoding your string into a UTF-8 bytes, and then turning that UTF-8 bytes into its string representation.

That's what this code means:

str(row[3].encode("utf-8"))

If you don't want to do that, just don't do that:

row[3]

Here's an example that shows what you're doing:

>>> s = 'à'
>>> s
'à'
>>> s.encode('utf-8')
b'\xc3\xa0'
>>> str(s.encode('utf-8'))
"b'\\xc3\\xa0'"

What you want here is the first one.

More generally, calling str on a bytes is almost never useful. If you unavoidably have a bytes and you need a str, you get it by calling the decode method. But in this case, you don't unavoidably have a bytes. (I mean, you could write row[3].encode("utf-8").decode("utf-8"), but that would obviously be pretty silly.)


As a side note—but a very important one—you should not be trying to str.format your values into the SQL string. Just use query parameters. Here's the obligatory xkcd link that explains the security/safety problem, and on top of that, you're making your code much more complicated, and even less efficient.

In other words, instead of doing this:

"VALUES ({:d}, \"{:s}\", \"{:s}\", \"{:s}\", \"{:s}\", \"{:s}\", \"{:s}\")".format(row[0], urlparse(row[1]).netloc, row[1], row[2].replace("\"", "'"), article_content, datetime.fromtimestamp(row[4]).strftime("%Y-%m-%d"), updated)

… just do this:

"VALUES (%s, %s, %s, %s, %s, %s, %s)"

And then, when you later execute the query, pass the arguments—without all that complicated converting to strings and quoting and replacing embedded quotes, just the values as-is—as the arguments to execute.

db.execute(q_i, (
    row[0], urlparse(row[i]).netloc, row[1], row[2], article_content, 
    datetime.fromtimestamp(row[4]).strftime("%Y-%m-%d"), updated))

In fact, if your next to last column is—or could be—a DATETIME column rather than a CHAR/VARCHAR/TEXT/whatever, you don't even need that strftime; just pass the datetime object.

And notice that this means that you don't need to do anything at all to article_content. The quote stuff is neither necessary nor a good idea (unless you have some other, app-specific reason that you need to avoid " characters in articles), and the encoding stuff is not solving any problem, but only causing a new one.

Upvotes: 1

Related Questions