Reputation: 2140
I have had the problem a few times now while working on projects and I would like to know if there's an elegant solution.
Problem I am pulling tweets via XML from twitter and uploading them to my DB however when I output them to screen I get these characters:
"moved to dusseldorf.�" OR también
and if I have Russian characters then I get lots of ugly boxes in place.
What I would like is the correct native accents to show under one encoding. I thought was possible with UTF-8.
What I am using
After reading in the XML file I am doing the following to cleanse the data:
$data = trim($data);
$data = htmlentities($data);
$data = mysql_real_escape_string($data);
My Database Collation is: utf8_general_ci
Web page character set is: charset=UTF-8
I think it could have something to do with HTML entities but I really appreciate a solution that works across the board on projects.
Thanks in advance.
Upvotes: 3
Views: 4313
Reputation: 168655
You can't use htmlentities()
in it's default state for XML data, because this function produces HTML entities, not XML entities.
The difference is that the HTML DTD defines a bunch of entity codes which web browsers are programmed to interpret. But most XML DTDs don't define them (if the XML even has a DTD).
The only entitity codes that are available by default to XML are >
, <
and &
. All other entities need to be presented using their numeric entity.
PHP doesn't have an xmlentities()
function, but if you read the manual page for htmlentities()
, you'll see in the comments that that plenty of people have had this same issue and have posted their solutions. After a quick browse through it, I'd suggest looking at the one named philsXMLClean()
.
Hope that helps.
Upvotes: 0
Reputation: 476960
Make sure that you set your php internal encoding ot UTF8 using iconv_set_encoding, and that you call htmlentities
with the encoding information as EdoDodo said. Also make sure that you're database stores with UTF8-encoding, though you say that's already the case.
Upvotes: 0
Reputation: 8334
Replace this line:
$data = htmlentities($data);
With this:
$data = htmlentities($data, null, "UTF-8");
That way, htmlentities() will leave valid UTF-8 characters alone. For more information see the documentation for htmlentities().
Upvotes: 6
Reputation: 449395
You need to change your connection's encoding to UTF-8 (it's usually iso-8859-1). See here: How can I store the '€' symbol in MySQL using PHP?
Calling htmlentities()
is unnecessary when you get the encodings right. I would remove it completely. You'll just have to be careful to use htmlspecialchars()
when outputting the data a in HTML context.
Upvotes: 2