Hipny
Hipny

Reputation: 749

PHP - Substring after X characters with special-characters

Sorry for the title, I really didn't know how to say this...

I often have a string that needs to be cut after X characters, my problem is that this string often contains special characters like : & egrave ;

So, I'm wondering, is their a way to know in php, without transforming my string, if when I am cutting my string, I am in the middle of a special char.

Example

This is my string with a special char : è - and I want it to cut in the middle of the "è" but still keeping the string intact

so right now my result with a sub string would be :

This is my string with a special char : &egra

but I want to have something like this :

This is my string with a special char : è

Upvotes: 4

Views: 4227

Answers (6)

Francis Avila
Francis Avila

Reputation: 31621

The best thing to do here is store your string as UTF-8 without any html entities, and use the mb_* family of functions with utf8 as the encoding.

But, if your string is ASCII or iso-8859-1/win1252, you can use the special HTML-ENTITIES encoding of the mb_string library:

$s = 'This is my string with a special char : è - and I want it to cut in the middle of the "è" but still keeping the string intact';
echo mb_substr($s, 0, 40, 'HTML-ENTITIES');
echo mb_substr($s, 0, 41, 'HTML-ENTITIES');

However, if your underlying string is UTF-8 or some other multibyte encoding, using HTML-ENTITIES is not safe! This is because HTML-ENTITIES really means "win1252 with high-bit characters as html entities". This is an example of where this can go wrong:

// Assuming that é is in utf8:
mb_substr('é ', 0, 2, 'HTML-ENTITIES') === 'é'
// should be 'é '

When your string is in a multibyte encoding, you must instead convert all html entities to a common encoding before you split. E.g.:

$strings_actual_encoding = 'utf8';
$s_noentities = html_entity_decode($s, ENT_QUOTES, $strings_actual_encoding); 
$s_trunc_noentities =  mb_substr($s_noentities, 0, 41, $strings_actual_encoding);

Upvotes: 7

Vyktor
Vyktor

Reputation: 20997

A little bruteforce solution, that I'm not really happy with would a PCRE expression, let's say that you want to pass 80 characters and the longest possible HTML expression is 7 chars long:

$regex = '~^(.{73}([^&]{7}|.{0,7}$|[^&]{0,6}&[^;]+;))(.*)~mx'
// Note, this could return a bit of shorter text
return preg_replace( $regexp, '$1', $text);

Just so you know:

  • .{73} - 73 characters
  • [^&]{7} - okay, we may fill it with anything that doesn't contain &
  • .{0,7}$ - keep in mind the possible end (this shouldn't be necessary because shorter text wouldn't match at all)
  • [^&]{0,6}&[^;]+; - up to 6 characters (you'd be at 79th), then & and let it finish

Something that seems much better but requires bit of play with numbers is to:

// check whether $text is at least $N chars long :)
if( strlen( $text) < $N){
    return;
}

// Get last &
$pos = strrpos( $text, '&', $N);

// We're not young anymore, we have to check this too (not entries at all) :)
if( $pos === false){
    return substr( $text, 0, $N);
}

// Get Last
$end = strpos( $text, ';', $N);

// false wouldn't be smaller then 0 (entry open at the beginning
if( $end === false){
    $end = -1;
}

// Okay, entry closed (; is after &)(
if( $end > $pos){
   return substr($text, 0, $N);
}

// Now we need to find first ;
$end = strpos( $text, ';', $N)
if( $end === false){
    // Not valid HTML, not closed entry, do whatever you want
}

return substr($text, 0, $end);

Check numbers, there may be +/-1 somewhere in indexes...

Upvotes: 2

0b10011
0b10011

Reputation: 18785

The best solution would be to store your text as UTF-8, instead of storing them as HTML entities. Other than that, if you don't mind the count being off (&grave; equals one character, instead of 7), then the following snippet should work:

<?php
$string = 'This is my string with a special char : &egrave; - and I want it to cut in the middle of the "&egrave;" but still keeping the string intact';
$cut_string = htmlentities(mb_substr(html_entity_decode($string, NULL, 'UTF-8'), 0, 45), NULL, 'UTF-8')."<br><br>";

Note: If you use a different function to encode the text (e.g. htmlspecialchars()), then use that function instead of htmlentities(). If you use a custom function, then use another custom function that does the opposite of your new custom function instead of html_entity_decode() (and custom function instead of htmlentities()).

Upvotes: 4

Mike Brant
Mike Brant

Reputation: 71384

You can use html_entity_decode() first to decode all the HTML entities. Then split your string. Then htmlentities() to re-encode the entities.

$decoded_string = html_entity_decode($original_string);
// implement logic to split string here

// then for each string part do the following:
$encoded_string_part = htmlentities($split_string_part);

Upvotes: 3

ashastral
ashastral

Reputation: 2848

The longest HTML entity is 10 characters long, including the ampersand and semicolon. If you intend to cut the string at X bytes, check bytes X-9 through X-1 for an ampersand. If the corresponding semicolon appears at byte X or later, cut the string after the semicolon instead of after byte X.

However, if you're willing to preprocess the string, Mike's solution will be more accurate because his cuts the string at X characters, not bytes.

Upvotes: 3

Tim S
Tim S

Reputation: 5101

I think you would have to use a combination of strpos and strrpos to find the next and previous spaces, parse the text between the spaces, check that against a known list of special characters, and if it matches, extend your "cut" to the position of the next space. If you had a code sample of what you have now, we could give you a better answer.

Upvotes: 0

Related Questions