How to avoid b' and UTF-8 literals in MySQL using Python 3

Question

I'm a bit of a novice. So I'm trying to parse html pages and put the contents in a column in MySQL, however I can't seem to get the actual foreign characters to appear, such as instead of á I get xc3xa1. My table has utf8mb4 as its character set and collation utf8mb4_unicode_ci. I have the following set-up:

Database_cnx = pymysql.connect(user='XXXX', password='XXXX',
                              host='XXXX',
                              database='XXXX',
                              use_unicode=True,
                              charset='utf8mb4')

article_content = str(row[3].encode("utf-8")).replace("'", "\'").replace("\"", "\'")

q_i = ("INSERT INTO article_items (" + ", ".join(article_table_col_name_new) + ")"
"VALUES ({:d}, \"{:s}\", \"{:s}\", \"{:s}\", \"{:s}\", \"{:s}\", \"{:s}\")".format(row[0], urlparse(row[1]).netloc, row[1], row[2].replace("\"", "'"), article_content, datetime.fromtimestamp(row[4]).strftime("%Y-%m-%d"), updated)
)

So how can I have it that only the actual article content appears in my column, instead of the b' byte and the utf-8 literals. thanks

abarnert · Accepted Answer

The problem is that you're explicitly encoding your string into a UTF-8 bytes, and then turning that UTF-8 bytes into its string representation.

That's what this code means:

str(row[3].encode("utf-8"))

If you don't want to do that, just don't do that:

row[3]

Here's an example that shows what you're doing:

>>> s = 'à'
>>> s
'à'
>>> s.encode('utf-8')
b'\xc3\xa0'
>>> str(s.encode('utf-8'))
"b'\xc3\xa0'"

What you want here is the first one.

More generally, calling str on a bytes is almost never useful. If you unavoidably have a bytes and you need a str, you get it by calling the decode method. But in this case, you don't unavoidably have a bytes. (I mean, you could write row[3].encode("utf-8").decode("utf-8"), but that would obviously be pretty silly.)

As a side note—but a very important one—you should not be trying to str.format your values into the SQL string. Just use query parameters. Here's the obligatory xkcd link that explains the security/safety problem, and on top of that, you're making your code much more complicated, and even less efficient.

In other words, instead of doing this:

"VALUES ({:d}, "{:s}", "{:s}", "{:s}", "{:s}", "{:s}", "{:s}")".format(row[0], urlparse(row[1]).netloc, row[1], row[2].replace(""", "'"), article_content, datetime.fromtimestamp(row[4]).strftime("%Y-%m-%d"), updated)

… just do this:

"VALUES (%s, %s, %s, %s, %s, %s, %s)"

And then, when you later execute the query, pass the arguments—without all that complicated converting to strings and quoting and replacing embedded quotes, just the values as-is—as the arguments to execute.

db.execute(q_i, (
    row[0], urlparse(row[i]).netloc, row[1], row[2], article_content, 
    datetime.fromtimestamp(row[4]).strftime("%Y-%m-%d"), updated))

In fact, if your next to last column is—or could be—a DATETIME column rather than a CHAR/VARCHAR/TEXT/whatever, you don't even need that strftime; just pass the datetime object.

And notice that this means that you don't need to do anything at all to article_content. The quote stuff is neither necessary nor a good idea (unless you have some other, app-specific reason that you need to avoid " characters in articles), and the encoding stuff is not solving any problem, but only causing a new one.

How to avoid b' and UTF-8 literals in MySQL using Python 3

Answers (2)

Related Questions

How to avoid b&#39; and UTF-8 literals in MySQL using Python 3

Answers (2)

Related Questions

How to avoid b' and UTF-8 literals in MySQL using Python 3