Reputation: 7585

Issue parsing html file with php

I have some code that parses through an html file and I stumbled across a page that contains this charcter which screwed up the parsing: “

When I execute the following code, $len is assigned a value of 3.

$test = "“";
$len = strlen($test);

I'm suspecting that this character might be unicode.

For now I'm getting around this problem by replacing the curly double quote with a standard double quote. However I'm concerned about other files that might contain similar characters and I don't want to have replace functions for each separate instance.

How do I get php to treat this as a single character?

Upvotes: 2

Answers (4)

Pekka

Reputation: 449555

PHP's standard string handling functions are not multi-byte aware, they stupidly count the number of bytes in the string.

If you have the multibyte extension installed, mb_strlen() is what you are looking for.

For example, if your data is UTF-8:

$test = "“";
$len = mb_strlen($test, "UTF-8");

Upvotes: 1

morgar

Reputation: 2407

You need to use the multibyte version of the functions > http://php.net/manual/en/function.mb-strlen.php

Upvotes: 1

Kaivosukeltaja

Reputation: 15735

Use mb_strlen(), it will handle multibyte characters.

Upvotes: 1

Marcin

Reputation: 1615

For unicode use php function was starts with mb_ (multibyte): For example: http://php.net/manual/en/function.mb-strlen.php

Upvotes: 1

Issue parsing html file with php

Answers (4)

Related Questions