mmackh
mmackh

Reputation: 3630

Character encoding error, cannot write valid XML from MySQL via PHP

The feed in question is: http://api.inoads.com/snowstorm/feed.xml

Here is the PHP code I am using for the generation:

<?php

$database =  'xxxx';
$dbconnect = mysql_pconnect('xxxx', 'xxxx', 'xxxx');
mysql_select_db($database, $dbconnect);

$query = "SELECT * FROM the_queue WHERE id LIKE '%'    ORDER BY id DESC LIMIT 25";
$result = mysql_query($query, $dbconnect);

while ($line = mysql_fetch_assoc($result))
        {
            $return[] = $line;
        }

$now = date("D, d M Y H:i:s T");

$output = "<?xml version=\"1.0\" encoding=\"UTF-8\"?>
            <rss version=\"2.0\">
                <channel>
                    <title>The Queue</title>
                    <link>http://readapp.net</link>
                    <description>A curated reading list.</description>
                    <language>en-us</language>
                    <pubDate>$now</pubDate>
                    <lastBuildDate>$now</lastBuildDate>
            ";

foreach ($return as $line)
{
    $output .= "<item><title>".htmlspecialchars($line['title'])."</title>
    <description>".htmlspecialchars($line['description'])."</description>
                    <link>".htmlspecialchars($line['link'])."</link>
                    <pubDate>".htmlspecialchars($line['pubDate'])."</pubDate>
                </item>";
}
$output .= "</channel></rss>";

$fh = fopen('feed.xml', 'w');
fwrite($fh, $output);
?>

What might be causing the error?

Here's a link from the feed validator: http://validator.w3.org/feed/check.cgi?url=http%3A%2F%2Fapi.inoads.com%2Fsnowstorm%2Ffeed.xml

Upvotes: 1

Views: 4676

Answers (5)

Abhi Beckert
Abhi Beckert

Reputation: 33329

You said the XML file is UTF-8, but when I download it and open it in my text editor it auto-detects the windows latin1 encoding, and the quotes display perfectly.

If I force my text editor to use UTF-8, it shows an error message because there are illegal characters for the UTF-8 encoding.

Therefore, your data is not UTF-8, it is latin1. You need to find out exactly where that's happening. It could be any one, or several of:

is the HTML page where the content is typed in by the user set to UTF-8?

If not, the browser will be sending latin1 quotes. To fix this, the first tag in your <head> needs to be:

<head>
  <meta http-equiv="Content-Type" content="text/html; charset=utf-8" />
  ...
</head>

is every browser correctly respecting your UTF-8 setting in that page's HTML?

If you specify UTF-8 and the page contains characters illegal in that encoding, some browsers might decide to use a different encoding despite the <meta> tag. How to check this is different in every browser.

is the MySQL connection when inserting into the database set to use UTF-8?

You need to be using UTF-8 here, or else MySQL may try to convert the encoding for you, often corrupting them. Set the encoding with:

$database =  'xxxx';
$dbconnect = mysql_pconnect('xxxx', 'xxxx', 'xxxx');
mysql_select_db($database, $dbconnect);
mysql_query('SET NAMES utf8', $dbconnect);

is the MySQL table (and individual column) set to use UTF-8?

Again, to avoid MySQL doing it's own buggy conversion, you need to make sure it's using UTF-8 for the table and also the individual comment. Do a structure dump of the database and check for:

CREATE TABLE `the_queue` (
  ...
) ... DEFAULT CHARSET=utf8;

And also make sure there isn't something like this on any of the columns:

`description` varchar(255) CHARACTER SET latin1,

is the MySQL connection when reading the database set to use UTF-8?

Your read connection also needs to be utf8. So double check that.

are you doing anything in the PHP that cannot handle UTF-8?

PHP has some functions which cannot be used on utf-8 strings, as it will corrupt them. One of those functions is htmlentities() so make sure you always use htmlspecialchars(). The easiest way to test this is to start commenting out big chunks of your code to see where the encoding is breaking.

Upvotes: 3

outis
outis

Reputation: 77400

The point of htmlentities is to replace all characters that have define HTML character entities with those entities. If you really don't want any character entities (as your desired result suggests), don't use htmlentities.

By default, htmlentities uses the latin-1 charset, so it chokes on the smart quotes (indeed, all multibyte characters), which is where you see the question marks. One fix is to use htmlspecialchars to convert a much more limited set of characters (&, <, >, ' and "). This will still convert the double quotes because, well, that's the point of htmlspecialchars, unless you specify the ENT_NOQUOTES as the second argument. Another fix is to specify the character set as the third argument (this isn't exclusive of using htmlspecialchars).

The fourth argument to either specifies whether or not to encode already encoded characters. Whether or not do double-encode depends on the source data.

$line['description'] = '"Dave, stop. Stop, will you? Stop, Dave. Will you stop, Dave?” ... “Dave, my mind is going,” HAL says, forlornly. “I can feel it. I can feel it.”';

echo "<description>" . htmlspecialchars($line['description'], ENT_NOQUOTES, 'UTF-8', false) . "</description>";

See also:

Upvotes: 1

jap1968
jap1968

Reputation: 7763

There is one problem here:

$output = "<?xml version=\"1.0\" encoding=\"UTF-8\"?>
...

There is a string containing "?>". This is the finalization marker for php. It will give you an error.

You can avoid these problems this way:

$output = "<?xml version=\"1.0\" encoding=\"UTF-8\"?".">
...

Upvotes: 1

josegil
josegil

Reputation: 365

Another error that you have it´s the format of the date. The date must be in format RFC-822, it must be in a format like this "Wed, 02 Oct 2002 08:00:00 EST", not "July/August 2008".

Upvotes: 0

bkowalikpl
bkowalikpl

Reputation: 827

Problem is that you are holding this string with quotes in database (as I assume). If it is true, PHP is removing quotes (which is proper), because of not causing bugs (SQL injection ex). So you have to remove quotes in DB and while generating XML file just add them. It is the simplest in my opinion. And try avoid double quotes ". You should use single ones '. In double PHP parser additionally checks what is in. So try to remove qoutes from DB and add them while generating XML. Should help.

Upvotes: 0

Related Questions