Reputation: 2105

PHP HTML truncate and UTF-8

I need to truncate string to specified length ignoring HTML tags. I found appropriate function here.

So I made light changes to it, added buffer input ob_start();

The problem is with UTF-8. If the last symbol of truncated string is from interval [ą,č,ę,ė,į,š,ų,ū,ž], then I get REPLACEMENT CHARACTER U+FFFD � at the end of the string.

Here is my code. You can copy-paste it and try by yourself:

<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
<html xmlns="http://www.w3.org/1999/xhtml">
<head>
<meta http-equiv="Content-Type" content="text/html; charset=utf-8" />
<title>String truncate</title>
</head>

<?php   

    $html = '<b>Koks nors tekstas</b>. <p>Lietuviškas žodis.</p>';

    $html = html_truncate(27, $html);

    echo $html;

    /* Truncate HTML, close opened tags
    *
    * @param int, maxlength of the string
    * @param string, html       
    * @return $html
    */  
    function html_truncate($maxLength, $html){

        $printedLength = 0;
        $position = 0;
        $tags = array();

        ob_start();

        while ($printedLength < $maxLength && preg_match('{</?([a-z]+)[^>]*>|&#?[a-zA-Z0-9]+;}', $html, $match, PREG_OFFSET_CAPTURE, $position)){

            list($tag, $tagPosition) = $match[0];

            // Print text leading up to the tag.
            $str = substr($html, $position, $tagPosition - $position);
            if ($printedLength + strlen($str) > $maxLength){
                print(substr($str, 0, $maxLength - $printedLength));
                $printedLength = $maxLength;
                break;
            }

            print($str);
            $printedLength += strlen($str);

            if ($tag[0] == '&'){
                // Handle the entity.
                print($tag);
                $printedLength++;
            }
            else{
                // Handle the tag.
                $tagName = $match[1][0];
                if ($tag[1] == '/'){
                    // This is a closing tag.

                    $openingTag = array_pop($tags);
                    assert($openingTag == $tagName); // check that tags are properly nested.

                    print($tag);
                }
                else if ($tag[strlen($tag) - 2] == '/'){
                    // Self-closing tag.
                    print($tag);
                }
                else{
                    // Opening tag.
                    print($tag);
                    $tags[] = $tagName;
                }
            }

            // Continue after the tag.
            $position = $tagPosition + strlen($tag);
        }

        // Print any remaining text.
        if ($printedLength < $maxLength && $position < strlen($html))
            print(substr($html, $position, $maxLength - $printedLength));

        // Close any open tags.
        while (!empty($tags))
             printf('</%s>', array_pop($tags));


        $bufferOuput = ob_get_contents();

        ob_end_clean();         

        $html = $bufferOuput;   

        return $html;   

    }

?>

<body>
</body>
</html>

This function result would look like this:

Koks nors tekstas.
Lietuvi�

Any ideas why this function is messing up with UTF-8 ?

Upvotes: 2

Answers (4)

user1058988

Reputation: 55

Just use the following function

echo utf8_encode($match[0]) // $match[0] It's your variable which you want to print

Upvotes: 0

hakre

Reputation: 198217

Any ideas why this function is messing up with UTF-8 ?

The general problem is that the function does not handle UTF-8 strings, but strings with an US-ASCII, Latin-1 or any other single-byte charset.

You're looking for making the function compatible with UTF-8 charsets. UTF-8 is a multibyte charset.

For that it is necessary that you verify that each of the string functions used inside that function properly handle the UTF-8 multibyte charset:

preg_match needs a pattern with the u modifier^Docs to work on UTF-8 strings.
substr needs to be replaced with mb_substr^Docs.
strlen needs to be replaced with mb_strlen^Docs

As you're dealing with HTML it's probably more save to use DOMDocument to manipulate the HTML chunk. That just as a note, it's much more flexible and does work properly.

Upvotes: 1