Reputation: 87
I'm a bit of a novice. So I'm trying to parse html pages and put the contents in a column in MySQL, however I can't seem to get the actual foreign characters to appear, such as instead of á I get xc3xa1. My table has utf8mb4 as its character set and collation utf8mb4_unicode_ci. I have the following set-up:
Database_cnx = pymysql.connect(user='XXXX', password='XXXX',
host='XXXX',
database='XXXX',
use_unicode=True,
charset='utf8mb4')
article_content = str(row[3].encode("utf-8")).replace("'", "\'").replace("\"", "\'")
q_i = ("INSERT INTO article_items (" + ", ".join(article_table_col_name_new) + ")"
"VALUES ({:d}, \"{:s}\", \"{:s}\", \"{:s}\", \"{:s}\", \"{:s}\", \"{:s}\")".format(row[0], urlparse(row[1]).netloc, row[1], row[2].replace("\"", "'"), article_content, datetime.fromtimestamp(row[4]).strftime("%Y-%m-%d"), updated)
)
So how can I have it that only the actual article content appears in my column, instead of the b' byte and the utf-8 literals. thanks
Upvotes: 1
Views: 231
Reputation: 142366
instead of á I get xc3xa1
That says that the problem was on inserting the data into the database. Go back to that code and let's discuss it.
(That is probably \xc3\xa1
, but the backslashes got lost somewhere.) C3A1
is the hex for the UTF-8 encoding of á
.
You can double check things by getting the string length -- it should be 1 (characters) or 2 (bytes) for á
, but is apparently 8 for \xc3\xa1
.
Upvotes: 0
Reputation: 365915
The problem is that you're explicitly encoding your string into a UTF-8 bytes, and then turning that UTF-8 bytes into its string representation.
That's what this code means:
str(row[3].encode("utf-8"))
If you don't want to do that, just don't do that:
row[3]
Here's an example that shows what you're doing:
>>> s = 'à'
>>> s
'à'
>>> s.encode('utf-8')
b'\xc3\xa0'
>>> str(s.encode('utf-8'))
"b'\\xc3\\xa0'"
What you want here is the first one.
More generally, calling str
on a bytes
is almost never useful. If you unavoidably have a bytes
and you need a str
, you get it by calling the decode
method. But in this case, you don't unavoidably have a bytes
. (I mean, you could write row[3].encode("utf-8").decode("utf-8")
, but that would obviously be pretty silly.)
As a side note—but a very important one—you should not be trying to str.format
your values into the SQL string. Just use query parameters. Here's the obligatory xkcd link that explains the security/safety problem, and on top of that, you're making your code much more complicated, and even less efficient.
In other words, instead of doing this:
"VALUES ({:d}, \"{:s}\", \"{:s}\", \"{:s}\", \"{:s}\", \"{:s}\", \"{:s}\")".format(row[0], urlparse(row[1]).netloc, row[1], row[2].replace("\"", "'"), article_content, datetime.fromtimestamp(row[4]).strftime("%Y-%m-%d"), updated)
… just do this:
"VALUES (%s, %s, %s, %s, %s, %s, %s)"
And then, when you later execute the query, pass the arguments—without all that complicated converting to strings and quoting and replacing embedded quotes, just the values as-is—as the arguments to execute
.
db.execute(q_i, (
row[0], urlparse(row[i]).netloc, row[1], row[2], article_content,
datetime.fromtimestamp(row[4]).strftime("%Y-%m-%d"), updated))
In fact, if your next to last column is—or could be—a DATETIME
column rather than a CHAR
/VARCHAR
/TEXT
/whatever, you don't even need that strftime
; just pass the datetime
object.
And notice that this means that you don't need to do anything at all to article_content
. The quote stuff is neither necessary nor a good idea (unless you have some other, app-specific reason that you need to avoid "
characters in articles), and the encoding stuff is not solving any problem, but only causing a new one.
Upvotes: 1