EvilCow
EvilCow

Reputation: 31

PHP DOMDocument nodeValue returns different encoding

When parsing a html document, using DOMDocument, I get different encoding from the nodeValue. In my dev environment I get UTF-8, however when uploading the script to webserver I get ISO-8859-1.

Can any one explain this behaviour and how to get same encoding?

<?php
header('Content-Type:text/html; charset=UTF-8');
$strHtml = file_get_contents("http://www.aftonbladet.se/senastenytt/ttnyheter/inrikes/article13397806.ab");

$objDOM= new DOMDocument();
@$objDOM->loadHTML($strHtml);
echo "Encoding: ". $objDOM->encoding."<br/>";

//Parse heading from DOMDocument
$objNodelist = $objDOM->getElementsByTagname('h1');
foreach ($objNodelist as $objElem)
{
    $strNodeValue = $objElem->nodeValue; //get the 
    break;
}
echo 'nodeValue: "'.$strNodeValue.'"<br/>';
echo 'utf8_decode: "'.utf8_decode($strNodeValue).'"<br/>';
echo 'utf8_encode: "'.utf8_encode($strNodeValue).'"<br/>';

//Parse heading using substring from html
$strHeading = substr($strHtml , strpos($strHtml, '<h1 class="abS32">')+18, strpos($strHtml, '</h1>') - strpos($strHtml, '<h1 class="abS32">')-18);
echo 'Heading from substring: "'.$strHeading.'"';
?>

Output when run in development environment
Encoding: utf-8
nodeValue: "När semestern inleds vankas åska"
utf8_decode: "N�r semestern inleds vankas �ska"
utf8_encode: "När semestern inleds vankas åska"
Heading from substring: "När semestern inleds vankas åska"

Output when run on public web server
Encoding: utf-8
nodeValue: "När semestern inleds vankas åska"
utf8_decode: "När semestern inleds vankas åska"
utf8_encode: "När semestern inleds vankas ÃÂ¥ska"
Heading from substring: "När semestern inleds vankas åska"

Apparently utf8_decode needs to be used on the public web server, but not in my dev environment. I would like to have the same behaviour on both systems. Any ideas?

Upvotes: 3

Views: 975

Answers (2)

EvilCowX
EvilCowX

Reputation: 11

Problem was sovled by updating PHP on the web hotel server.

Old configuration on web hotel:
PHP Version: 5.2.6-1+lenny13
libxml Version: 2.6.32

Updated configuration on web hotel:
PHP Version 5.3.3-7+squeeze3
libxml Version 2.7.8

The script now generates the same output in both environments
Encoding: utf-8
nodeValue: "När semestern inleds vankas åska"
utf8_decode: "När semestern inleds vankas åska"
utf8_encode: "När semestern inleds vankas ÃÂ¥ska"
Heading from substring: "När semestern inleds vankas åska"

Upvotes: 1

Ian
Ian

Reputation: 2021

I can think of two possible reasons for this behaviour.

First - Take a look at the default_charset in the two php.ini files. I think you will find that one sets it to "iso-8859-1" (the default) and the other to "utf8".

Second, check the code used to connect from php to your database, and the database connection defauilts. These might also be different.

You can use the following code to switch a Mysql connection to utf-8.

if (phpversion() > "5.0.7") {
        $result = mysql_set_charset('utf8');
    } else {
        $result = mysql_query("SET NAMES 'utf8' COLLATE 'utf8_unicode_ci';");
    }

Upvotes: 0

Related Questions