How can I detect, or correctly identify the length, of strange characters?

Question

I am inserting soft hyphens into long words programatically, and am having problems with unusual characters, specifically: ■

Any word over 10 characters gets the soft hyphen treatment. Words are defined with a regex: [A-Za-z0-9,.]+ (to include long numbers). If I split a string containing two of the above unicode character with that regex, I get a 'word' like this: ■■

My script then goes through each word, measured the length (mb_strlen($word, 'UTF-8')), and if it is over an arbitrary number of characters, loops through the letters and inserts soft hyphens all over the place (every third character, not in the last five characters).

With the ■■, the word length is coming out as high enough to trigger the replacement (10). So soft hyphens are inserted, but they are inserted within the characters. So what I get out is something like:

��■

In the database, these ■ characters are being stored (in a json_encoded block) as "\u2002", so I can see where the string length is coming from. What I need is a way to identify these characters, so I can avoid adding soft hyphens to words that contain them. Any ideas, anyone?

(Either that, or a way to measure the length of a string, counting these as single characters, and then a way to split that string into characters without splitting it part-way through a multi-byte character.)

bobince · Accepted Answer

With the same caveats as listed in the comments about guessing without seeing the code:

mb_strlen($word, 'UTF-8'), and if it is over an arbitrary number of characters, loops through the letters

I suspect you are actually looping through bytes. This is what will happen if you use array-access notation on a string.

When you are using a multibyte encoding like UTF-8, a letter (or more generally ‘character’) may take up more than one byte of storage. If you insert or delete in the middle of a byte sequence you will get mangled results.

This is why you must use mb_strlen and not plain old strlen. Some languages have a native Unicode string type where each item is a character, but in PHP strings are completely byte-based and if you want to interact with them in a character-by-character way you must use the mb_string functions. In particular to read a single character from a string you use mb_substr, and you'd loop your index from 0 to mb_strlen.

It would probably be simpler to take the matched word and use a regular expression replacement to insert the soft hyphen between each sequence. You can get multibyte string support for regex by using the u flag. (This only works for UTF-8, but UTF-8 is the only multibyte encoding you'd ever actually want to use.)

const SHY= "\xC2\cAD"; // U+00AD Soft Hyphen encoded as UTF-8
$wrappableword= preg_replace('/.{3}\B/u', '$1'.SHY, $longword);

How can I detect, or correctly identify the length, of strange characters?

Answers (1)

Related Questions