Novice
Novice

Reputation: 1011

How to get the length of a string containing character references while counting the character references as one single character?

How can I get the length of string that also contains character references? I want to count only the number of characters which will be displayed in the browser. Like

$raw = "Stack&#00f9"    =  Length = 6  
$raw = "Stack12345"  = Length = 10  
$raw = "Stack&#00f9&#00f9" = Length = 7  

Thanks in advance

Upvotes: 0

Views: 1335

Answers (5)

Paul Dixon
Paul Dixon

Reputation: 300825

As your strings contain literal encodings of unicode chars (rather than being, say, UTF-8 encoded) you could obtain the length by simply replacing them with a dummy char, thus:

$length=strlen(preg_replace('/&#[0-9a-f]{4}/', '_', $raw));

If they were encoded with something PHP understands, like UTF-8, you could use mb_strlen() intead.

Upvotes: 2

Gumbo
Gumbo

Reputation: 655169

strlen is a single-byte string function that fails on mutli-byte strings as it only returns the number of bytes rather than the number of characters (since in single-byte strings every byte represents one character).

For multi-byte strings use strlen’s multi-byte counterpart mb_strlen instead and don’t forget to specify the proper character encoding.

And to have HTML character references being interpreted as a single character, use html_entity_decode to replace them by the characters they represent:

$str = html_entity_decode('Stackù', ENT_QUOTES, 'UTF-8');
var_dump(mb_strlen($str, 'UTF-8'));  // int(6)

Note that &#00f9 is not a valid character reference as it’s missing a x or X after &# for the hexadecimal notation and a ; after the hexadecimal value.

Upvotes: 1

Victor Nicollet
Victor Nicollet

Reputation: 24577

I would go with:

$len = mb_strlen(html_entities_decode($myString, ENT_QUOTES, 'UTF-8'),'UTF-8');

Although I would first question why you have HTML entities inside your strings, as opposed to manipulating actual UTF-8 encoded strings.

Also, be careful in that your HTML entities are not correctly written (they need to end with a semicolon). If you do not add the semicolon, any entity-related functions will fail, and many browsers will fail to render your entities correctly.

Upvotes: 3

Alex Pliutau
Alex Pliutau

Reputation: 21957

mb_strlen('string' , 'UTF-8');

Upvotes: -1

Benjamin Cremer
Benjamin Cremer

Reputation: 4822

Have a look at mb_strlen

Upvotes: -1

Related Questions