James
James

Reputation: 1842

&nbsp removal in PHP

I need to remove all dodgy html characters from a web-site I'm parsing using Curl and simplehtml dom.

<?php
$html = "this is&nbsp;a text";
var_dump($html);
var_dump(html_entity_decode($html,ENT_COMPAT,"UTF-8"));

Which outputs

string(19) "this is a text"

string(15) "this is a text"

I don't want to use preg* as there are other characters in the text (e.g. &deg). This is driving me insane now!

Thanks, James

Upvotes: 1

Views: 1414

Answers (2)

Overv
Overv

Reputation: 8529

You need to specify your output encoding with a header:

<?php
    header('Content-Type: text/html; charset=utf-8');

    $html = "this is&nbsp;a text";
    var_dump($html);
    var_dump(html_entity_decode($html,ENT_COMPAT,"UTF-8"));
?>

The browser does not assume UTF-8 by default, that's why it displays the wrong character.

Upvotes: 4

John Conde
John Conde

Reputation: 219804

If that's the only character that needs replacing just use str_replace()

var_dump(str_replace('&nbsp;', ' ', "this is&nbsp;a text"));

See it in action

Upvotes: 1

Related Questions