Reputation: 2095
I need to truncate string to specified length ignoring HTML tags. I found appropriate function here.
So I made light changes to it, added buffer input ob_start();
The problem is with UTF-8. If the last symbol of truncated string is from interval [ą,č,ę,ė,į,š,ų,ū,ž], then I get REPLACEMENT CHARACTER U+FFFD � at the end of the string.
Here is my code. You can copy-paste it and try by yourself:
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
<html xmlns="http://www.w3.org/1999/xhtml">
<head>
<meta http-equiv="Content-Type" content="text/html; charset=utf-8" />
<title>String truncate</title>
</head>
<?php
$html = '<b>Koks nors tekstas</b>. <p>Lietuviškas žodis.</p>';
$html = html_truncate(27, $html);
echo $html;
/* Truncate HTML, close opened tags
*
* @param int, maxlength of the string
* @param string, html
* @return $html
*/
function html_truncate($maxLength, $html){
$printedLength = 0;
$position = 0;
$tags = array();
ob_start();
while ($printedLength < $maxLength && preg_match('{</?([a-z]+)[^>]*>|&#?[a-zA-Z0-9]+;}', $html, $match, PREG_OFFSET_CAPTURE, $position)){
list($tag, $tagPosition) = $match[0];
// Print text leading up to the tag.
$str = substr($html, $position, $tagPosition - $position);
if ($printedLength + strlen($str) > $maxLength){
print(substr($str, 0, $maxLength - $printedLength));
$printedLength = $maxLength;
break;
}
print($str);
$printedLength += strlen($str);
if ($tag[0] == '&'){
// Handle the entity.
print($tag);
$printedLength++;
}
else{
// Handle the tag.
$tagName = $match[1][0];
if ($tag[1] == '/'){
// This is a closing tag.
$openingTag = array_pop($tags);
assert($openingTag == $tagName); // check that tags are properly nested.
print($tag);
}
else if ($tag[strlen($tag) - 2] == '/'){
// Self-closing tag.
print($tag);
}
else{
// Opening tag.
print($tag);
$tags[] = $tagName;
}
}
// Continue after the tag.
$position = $tagPosition + strlen($tag);
}
// Print any remaining text.
if ($printedLength < $maxLength && $position < strlen($html))
print(substr($html, $position, $maxLength - $printedLength));
// Close any open tags.
while (!empty($tags))
printf('</%s>', array_pop($tags));
$bufferOuput = ob_get_contents();
ob_end_clean();
$html = $bufferOuput;
return $html;
}
?>
<body>
</body>
</html>
This function result would look like this:
Koks nors tekstas.
Lietuvi�
Any ideas why this function is messing up with UTF-8 ?
Upvotes: 2
Views: 1035
Reputation: 55
Just use the following function
echo utf8_encode($match[0]) // $match[0] It's your variable which you want to print
Upvotes: 0
Reputation: 197624
Any ideas why this function is messing up with UTF-8 ?
The general problem is that the function does not handle UTF-8 strings, but strings with an US-ASCII, Latin-1 or any other single-byte charset.
You're looking for making the function compatible with UTF-8 charsets. UTF-8 is a multibyte charset.
For that it is necessary that you verify that each of the string functions used inside that function properly handle the UTF-8 multibyte charset:
preg_match
needs a pattern with the u
modifierDocs to work on UTF-8 strings.substr
needs to be replaced with mb_substr
Docs.strlen
needs to be replaced with mb_strlen
DocsAs you're dealing with HTML it's probably more save to use DOMDocument
to manipulate the HTML chunk. That just as a note, it's much more flexible and does work properly.
Upvotes: 1
Reputation: 23011
I would suggest to simply use a unicode-safe substring function such as mb_substr(), to truncate the unicode strings.
So basically try to replace all substr()
occurences by mb_substr()
.
Before that, check that the mbstring PHP module is enabled on your environment.
Upvotes: 1