Prakash
Prakash

Reputation: 2723

Detect Encoding and Convert Everything to UTF-8 with PHP

I want to extract various data from URLs that will be converted to UTF-8 no matter what the encoding methods is used in original page (or at least it will work on most of the source encodings).

So, after looking and searching many discussions and answers, I finally came with the following code, with which I am parsing HTML data twice (once for detecting encoding and a second time for getting the actual data). This is working at least on all the checked URLs. But I think that the code is poorly written.

Can anyone let me know if there are any better alternatives to do the same or if I need any improvements on the code?

<?php
header('Content-Type: text/html; charset=utf-8');
require_once 'curl.php';
require_once 'curl_response.php';

$curl = new Curl;

$url = "http://" . $_GET['domain'];
$curl_response = $curl->get($url);
$header_content_type = $curl_response->headers['Content-Type'];

$dom_doc = new DOMDocument();

libxml_use_internal_errors(TRUE);
$dom_doc->loadHTML('<?xml encoding="utf-8" ?>' . $curl_response);
libxml_use_internal_errors(FALSE);

$metas = $dom_doc->getElementsByTagName('meta');
foreach ($metas as $meta) {
    if (strtolower($meta->getAttribute('http-equiv')) == 'content-type') {
        $meta_content_type = $meta->getAttribute('content');
    }
    if ($meta->getAttribute('charset') != '') {
        $html5_charset = $meta->getAttribute('charset');
    }
}

if (preg_match('/charset=(.+)/', $header_content_type, $m)) {
    $charset = $m[1];
} elseif (preg_match('/charset=(.+)/', $meta_content_type, $m)) {
    $charset = $m[1];
} elseif (!empty($html5_charset)) {
    $charset = $html5_charset;
} elseif (preg_match('/encoding=(.+)/', $curl_response, $m)) {
    $charset = $m[1];
} else {
    // browser default charset
    // $charset = 'ISO-8859-1';
}

if (!empty($charset) && $charset != "utf-8") {
    $tmp = iconv($charset,'utf-8', $curl_response);
    libxml_use_internal_errors(TRUE);
    $dom_doc->loadHTML('<?xml encoding="utf-8" ?>' . $tmp);
    libxml_use_internal_errors(FALSE); 
}

$page_title = $dom_doc->getElementsByTagName('title')->item(0)->nodeValue;

$metas = $dom_doc->getElementsByTagName('meta');
foreach ($metas as $meta) {
    if (strtolower($meta->getAttribute('name')) == 'description') {
        $meta_description = $meta->getAttribute('content');
    }
    if (strtolower($meta->getAttribute('name')) == 'keywords') {
        $meta_tags = $meta->getAttribute('content');
    }
}

print $charset;
print "<hr>";

print $page_title;
print "<hr>";

print $meta_description;
print "<hr>";

print $meta_tags;
print "<hr>";

print "Memory Peak Usages: " . memory_get_peak_usage()/1024/1024 . " MB";
?>

Upvotes: 1

Views: 982

Answers (2)

S&#233;bastien Renauld
S&#233;bastien Renauld

Reputation: 19662

Your question is too open-ended, and I've voted to close it. However, I will still provide a stub of an answer that will, hopefully, point you in the right direction.

At the moment, you are checking user-defined input for the charset. This is a very, very, very bad move, for various reasons:

  • Most webmasters on small site will just header("Content-type: text/html; charset=utf-8") because they've heard it is good practice, without actually encoding. Not taking this into account will lead to mangled UTF-8 outputs
  • Some webmasters do the opposite: they do not set a header, and their webserver outputs ISO-8859-1 headers despite an UTF-8 encoding. Visibly on a page, this does not matter - it matters for DOMDocument (I've had this issue recently)
  • iconv double utf-8 encoding is never fun.

I'd strongly advise using a utility to decode UTF-8 until there are no more entities within the UTF-8 extended range of characters and then encoding once rather than relying on iconv or multibyte encoding. The reason is simple: these can get it wrong. You can also set an error handler to parse DOMDocument errors in order to catch and redirect the loadXML "failed due to malformed XML" errors, which will not be related to your character encoding at all. Basically, the key to you problem is to not blindly do stuff.

If you'd like good targets where you need to worry about UTF-8, parse the home page of Google Play. They send out malformed replies (which is what initially forced me to go through the UTF-8-decode-until-nothing-is-in-the-range approach). It will also show you that DOMDocument can fail due to a wide variety of reasons - not just charset - and that you need to follow the errors to deal with them.

Other performance pointers outside of that big encoding snafu include:

  • Fragmenting your code into resultant functions. You've got a lot of repetition in there - learn to use functions to stop having to explicitely write the same core functions multiple times.
  • This:

    if (preg_match('/charset=(.+)/', $header_content_type, $m)) { $charset = $m[1]; } elseif (preg_match('/charset=(.+)/', $meta_content_type, $m)) {

is horrible. You can easily replace it with a strpos call, which will speed this particular set of ifs by about 5-10x. * $metas = $dom_doc->getElementsByTagName('meta'); - you're aware that DOMDocument will go through your entire DOM when you use this method, right? Consider restricting the XPath query to just the head tag (which is always the first child of html, which is the document. XPath: /html/head[0])

Upvotes: 1

Frank
Frank

Reputation: 91

In regard to performance you should be using unset(); when you're done with variables or values even if you're going to reset their values, but not if you need the value further down your script. PHP cannot reclaim memory and will reuse the preallocated memory released from the unset command for future use.

Another thing you could do is take huge chunks of that code and split it into functions that return resultant values. Remember that function variables and memory are automatically released after execution unless you're working with global variables.

Those will help performance and memory utilization.

Upvotes: 0

Related Questions