user431806
user431806

Reputation: 416

PHP - html_simple_dom, crawlers encodes innerhtml?

Im using PHP html_simple_dom.

The targeted site is using UTF-8. My php as well as the stream context are set to use UTF 8.

An element (which i inspect by browser) has an innerHTML of "AAA ' BBB", at least as far as when its rendering using my firefox and chrome browsers.

However, my PHP script always fetches this string as "AAA ' BBB". I can fix this using htmlspecialchars_decode($string, 1), but i really want to know why the PHP script, or rather the website is ("wrongly) encoding the string in the first place when visiting it using my PHP, which is explicitly set to UTF

header('Content-Type: text/html; charset=utf-8');
define("CONTEXT", stream_context_create(
    array(
        "http" =>
            array(
                "header" => 'Content-Type: text/html; charset=utf-8'
               // also tried 'header' => 'Accept-Charset: UTF-8'
            )
        )
)
);

targetsite reads UTF-8 - http://mtggoldfish.com.cutercounter.com/

$html = file_get_html($url, false, CONTEXT);

// do things, blurts out every "'" as encoded &#039

Upvotes: 0

Views: 42

Answers (1)

Joffrey Schmitz
Joffrey Schmitz

Reputation: 2438

Browser inspectors do a bit of transformation to have something human-readable.

Create a simple HTML with only AAA ' BBB in the body, you will see AAA ' BBB in the inspectors.

If you really want to see the content of the page, look at the source code (which is what file_get_html gets)

Upvotes: 1

Related Questions