Alex
Alex

Reputation: 33

Can't decode html entities in title

I am having trouble decoding entities in the title from this youtube video:

http://www.youtube.com/watch?v=p7NMsywVQhY

Here is my code:

$url = 'http://www.youtube.com/watch?v=p7NMsywVQhY';
$html = @file_get_contents($url);
$doc = new DOMDocument();
@$doc->loadHTML($html);

$nodes = $doc->getElementsByTagName('title');
$title = $nodes->item(0)->nodeValue;

//decode the '‪' in the title
$title = html_entity_decode($title,ENT_QUOTES,'UTF-8'); //does not seem to have any effect
//decode the utf data
$title = utf8_decode($title);

$title returns everything fine except returns question marks where ‪ is originally in the title.

Thanks.

Upvotes: 2

Views: 1575

Answers (2)

Mike
Mike

Reputation: 340

Try this to force correct detection of the charset:

$doc = new DOMDocument();
@$doc->loadHTML('<?xml encoding="UTF-8">' . $html);

$nodes = $doc->getElementsByTagName('title');
$title = $nodes->item(0)->nodeValue;

echo $title;

Upvotes: 0

MatTheCat
MatTheCat

Reputation: 18721

I don't know if PHP provides any function to do that, however you can use preg_replace like this:

$string = preg_replace('/&#x([0-9a-f]+);/ei', 'chr(hexdec("$1"))', $string);

Upvotes: 1

Related Questions