loeffel
loeffel

Reputation: 485

Shorten text without splitting words or breaking html tags

I am trying to cut off text after 236 chars without cutting words in half and preserving html tags. This is what I am using right now:

$shortdesc = $_helper->productAttribute($_product, $_product->getShortDescription(), 'short_description');
$lenght = 236;
echo substr($shortdesc, 0, strrpos(substr($shortdesc, 0, $lenght), " "));

While this is working in most cases, it won't respect html tags. So for example this text:

Lorem ipsum dolor sit amet, consetetur sadipscing elitr, sed diam nonumy eirmod tempor invidunt ut labore et dolore magna aliquyam erat, sed diam voluptua. At vero eos et accusam et justo duo dolores et ea rebum. <strong>Stet clita kasd gubergren</strong>

will get cut off with the tag still being open. Is there any way to cut off text after 236 chars but respecting html tags?

Upvotes: 18

Views: 18629

Answers (7)

BennyA
BennyA

Reputation: 1

This will work with Unicode (from @nice ass answer):

class Html
{
    protected
        $reachedLimit = false,
        $totalLen = 0,
        $maxLen = 25,
        $toRemove = [];

    public static function trim($html, $maxLen = 25)
    {

        $dom = new \DOMDocument();
        $dom->loadHTML('<?xml encoding="UTF-8">' . $html, LIBXML_HTML_NOIMPLIED | LIBXML_HTML_NODEFDTD);

        $instance = new static();
        $toRemove = $instance->walk($dom, $maxLen);

        // remove any nodes that exceed limit
        foreach ($toRemove as $child) {
            $child->parentNode->removeChild($child);
        }

        return $dom->saveHTML();
    }

    protected function walk(\DOMNode $node, $maxLen)
    {

        if ($this->reachedLimit) {
            $this->toRemove[] = $node;
        } else {
            // only text nodes should have text,
            // so do the splitting here
            if ($node instanceof \DOMText) {
                $this->totalLen += $nodeLen = mb_strlen($node->nodeValue);

                // use mb_strlen / mb_substr for UTF-8 support
                if ($this->totalLen > $maxLen) {
                    dump($node->nodeValue);
                    $node->nodeValue = mb_substr($node->nodeValue, 0, $nodeLen - ($this->totalLen - $maxLen)) . '...';
                    $this->reachedLimit = true;
                }
            }

            // if node has children, walk its child elements
            if (isset($node->childNodes)) {
                foreach ($node->childNodes as $child) {
                    $this->walk($child, $maxLen);
                }
            }
        }

        return $this->toRemove;
    }
}

Upvotes: 0

nice ass
nice ass

Reputation: 16709

This should do it:

class Html
{
    protected
        $reachedLimit = false,
        $totalLen = 0,
        $maxLen = 25,
        $toRemove = array();

    public static function trim($html, $maxLen = 25)
    {

        $dom = new DomDocument();

        if (version_compare(PHP_VERSION, '5.4.0') < 0) {
            $dom->loadHTML($html);
        } else {
            $dom->loadHTML($html, LIBXML_HTML_NOIMPLIED | LIBXML_HTML_NODEFDTD);
        }

        $instance = new static();
        $toRemove = $instance->walk($dom, $maxLen);

        // remove any nodes that exceed limit
        foreach ($toRemove as $child) {
            $child->parentNode->removeChild($child);
        }

        // remove wrapper tags added by DD (doctype, html...)
        if (version_compare(PHP_VERSION, '5.4.0') < 0) {
            // http://stackoverflow.com/a/6953808/1058140
            $dom->removeChild($dom->firstChild);
            $dom->replaceChild($dom->firstChild->firstChild->firstChild, $dom->firstChild);

            return $dom->saveHTML();
        }

        return $dom->saveHTML();
    }

    protected function walk(DomNode $node, $maxLen)
    {

        if ($this->reachedLimit) {
            $this->toRemove[] = $node;
        } else {
            // only text nodes should have text,
            // so do the splitting here
            if ($node instanceof DomText) {
                $this->totalLen += $nodeLen = strlen($node->nodeValue);

                // use mb_strlen / mb_substr for UTF-8 support
                if ($this->totalLen > $maxLen) {
                    $node->nodeValue = substr($node->nodeValue, 0, $nodeLen - ($this->totalLen - $maxLen)) . '...';
                    $this->reachedLimit = true;
                }
            }

            // if node has children, walk its child elements
            if (isset($node->childNodes)) {
                foreach ($node->childNodes as $child) {
                    $this->walk($child, $maxLen);
                }
            }
        }

        return $this->toRemove;
    }
}

Use like: $str = Html::trim($str, 236);

(demo here)


Some performance comparisons between this and cakePHP's regex solution

enter image description here

There's very little difference, and at very large string sizes, DomDocument is actually faster. Reliability is more important than saving a few microseconds in my opinion.

Upvotes: 18

Brankodd
Brankodd

Reputation: 841

Here is JS solution: trim-html

The idea is to split HTML string in that way to have an array with elements being html tag(open or closed) or just string.

var arr = html.replace(/</g, "\n<")
              .replace(/>/g, ">\n")
              .replace(/\n\n/g, "\n")
              .replace(/^\n/g, "")
              .replace(/\n$/g, "")
              .split("\n");

Than we can iterate through array and count characters.

Upvotes: -2

Dilip Rajkumar
Dilip Rajkumar

Reputation: 7074

I did in JS, hope this logic will help in PHP too..

splitText : function(content, count){
        var originalContent = content;
         content = content.substring(0, count);
          //If there is no occurance of matches before breaking point and the hit breakes in between html tags.
         if (content.lastIndexOf("<") > content.lastIndexOf(">")){
            content = content.substring(0, content.lastIndexOf('<'));
            count = content.length;
            if(originalContent.indexOf("</", count)!=-1){
                content += originalContent.substring(count, originalContent.indexOf('>', originalContent.indexOf("</", count))+1);
            }else{
                 content += originalContent.substring(count, originalContent.indexOf('>', count)+1);
            }
          //If the breaking point is in between tags.
         }else if(content.lastIndexOf("<") != content.lastIndexOf("</")){
            content = originalContent.substring(0, originalContent.indexOf('>', count)+1);
         }
        return content;
    },

Hope this logic helps some one..

Upvotes: -2

enenen
enenen

Reputation: 1967

function limitStrlen($input, $length, $ellipses = true, $strip_html = true, $skip_html) 
{
    // strip tags, if desired
    if ($strip_html || !$skip_html) 
    {
        $input = strip_tags($input);

        // no need to trim, already shorter than trim length
        if (strlen($input) <= $length) 
        {
            return $input;
        }

        //find last space within length
        $last_space = strrpos(substr($input, 0, $length), ' ');
        if($last_space !== false) 
        {
            $trimmed_text = substr($input, 0, $last_space);
        } 
        else 
        {
            $trimmed_text = substr($input, 0, $length);
        }
    } 
    else 
    {
        if (strlen(strip_tags($input)) <= $length) 
        {
            return $input;
        }

        $trimmed_text = $input;

        $last_space = $length + 1;

        while(true)
        {
            $last_space = strrpos($trimmed_text, ' ');

            if($last_space !== false) 
            {
                $trimmed_text = substr($trimmed_text, 0, $last_space);

                if (strlen(strip_tags($trimmed_text)) <= $length) 
                {
                    break;
                }
            } 
            else 
            {
                $trimmed_text = substr($trimmed_text, 0, $length);
                break;
            }
        }

        // close unclosed tags.
        $doc = new DOMDocument();
        $doc->loadHTML($trimmed_text);
        $trimmed_text = $doc->saveHTML();
    }

    // add ellipses (...)
    if ($ellipses) 
    {
        $trimmed_text .= '...';
    }

    return $trimmed_text;
}

$str = "<h1><strong><span>Lorem</span></strong> <i>ipsum</i> <p class='some-class'>dolor</p> sit amet, consetetur.</h1>";

// view the HTML
echo htmlentities(limitStrlen($str, 22, false, false, true), ENT_COMPAT, 'UTF-8');

// view the result
echo limitStrlen($str, 22, false, false, true);

Note: There may be a better way to close tags instead of using DOMDocument. For example we can use a p tag inside a h1 tag and it still will work. But in this case the heading tag will close before the p tag because theoretically it's not possible to use p tag inside it. So, be careful for HTML's strict standards.

Upvotes: -1

fullybaked
fullybaked

Reputation: 4127

Best solution I have come across for this is from the CakePHP framework TextHelper class

Here is the method

/**
* Truncates text.
*
* Cuts a string to the length of $length and replaces the last characters
* with the ending if the text is longer than length.
*
* ### Options:
*
* - `ending` Will be used as Ending and appended to the trimmed string
* - `exact` If false, $text will not be cut mid-word
* - `html` If true, HTML tags would be handled correctly
*
* @param string  $text String to truncate.
* @param integer $length Length of returned string, including ellipsis.
* @param array $options An array of html attributes and options.
* @return string Trimmed string.
* @access public
* @link http://book.cakephp.org/view/1469/Text#truncate-1625
*/
function truncate($text, $length = 100, $options = array()) {
    $default = array(
        'ending' => '...', 'exact' => true, 'html' => false
    );
    $options = array_merge($default, $options);
    extract($options);

    if ($html) {
        if (mb_strlen(preg_replace('/<.*?>/', '', $text)) <= $length) {
            return $text;
        }
        $totalLength = mb_strlen(strip_tags($ending));
        $openTags = array();
        $truncate = '';

        preg_match_all('/(<\/?([\w+]+)[^>]*>)?([^<>]*)/', $text, $tags, PREG_SET_ORDER);
        foreach ($tags as $tag) {
            if (!preg_match('/img|br|input|hr|area|base|basefont|col|frame|isindex|link|meta|param/s', $tag[2])) {
                if (preg_match('/<[\w]+[^>]*>/s', $tag[0])) {
                    array_unshift($openTags, $tag[2]);
                } else if (preg_match('/<\/([\w]+)[^>]*>/s', $tag[0], $closeTag)) {
                    $pos = array_search($closeTag[1], $openTags);
                    if ($pos !== false) {
                        array_splice($openTags, $pos, 1);
                    }
                }
            }
            $truncate .= $tag[1];

            $contentLength = mb_strlen(preg_replace('/&[0-9a-z]{2,8};|&#[0-9]{1,7};|&#x[0-9a-f]{1,6};/i', ' ', $tag[3]));
            if ($contentLength + $totalLength > $length) {
                $left = $length - $totalLength;
                $entitiesLength = 0;
                if (preg_match_all('/&[0-9a-z]{2,8};|&#[0-9]{1,7};|&#x[0-9a-f]{1,6};/i', $tag[3], $entities, PREG_OFFSET_CAPTURE)) {
                    foreach ($entities[0] as $entity) {
                        if ($entity[1] + 1 - $entitiesLength <= $left) {
                            $left--;
                            $entitiesLength += mb_strlen($entity[0]);
                        } else {
                            break;
                        }
                    }
                }

                $truncate .= mb_substr($tag[3], 0 , $left + $entitiesLength);
                break;
            } else {
                $truncate .= $tag[3];
                $totalLength += $contentLength;
            }
            if ($totalLength >= $length) {
                break;
            }
        }
    } else {
        if (mb_strlen($text) <= $length) {
            return $text;
        } else {
            $truncate = mb_substr($text, 0, $length - mb_strlen($ending));
        }
    }
    if (!$exact) {
        $spacepos = mb_strrpos($truncate, ' ');
        if (isset($spacepos)) {
            if ($html) {
                $bits = mb_substr($truncate, $spacepos);
                preg_match_all('/<\/([a-z]+)>/', $bits, $droppedTags, PREG_SET_ORDER);
                if (!empty($droppedTags)) {
                    foreach ($droppedTags as $closingTag) {
                        if (!in_array($closingTag[1], $openTags)) {
                            array_unshift($openTags, $closingTag[1]);
                        }
                    }
                }
            }
            $truncate = mb_substr($truncate, 0, $spacepos);
        }
    }
    $truncate .= $ending;

    if ($html) {
        foreach ($openTags as $tag) {
            $truncate .= '</'.$tag.'>';
        }
    }

    return $truncate;
}

Other frameworks may have similar (or different) solutions to this problem, so you could take a look at them too. My familiarity with Cake is what prompted my linking to their solution

Edit:

Just tested this method in an app I'm working on with the OP's text

<?php 
echo truncate(
'Lorem ipsum dolor sit amet, consetetur sadipscing elitr, sed diam nonumy eirmod tempor invidunt ut labore et dolore magna aliquyam erat, sed diam voluptua. At vero eos et accusam et justo duo dolores et ea rebum. <strong>Stet clita kasd gubergren</strong>', 
236, 
array('html' => true, 'ending' => '')); 
?>

Output:

Lorem ipsum dolor sit amet, consetetur sadipscing elitr, sed diam nonumy eirmod tempor invidunt ut labore et dolore magna aliquyam erat, sed diam voluptua. At vero eos et accusam et justo duo dolores et ea rebum. <strong>Stet clita kasd gubegre</strong>

Notice the output stops just short of completing the last word, but includes the complete strong tags

Upvotes: 19

Phoenix
Phoenix

Reputation: 753

Can I just give a thought ?

Sample text :

Lorem ipsum dolor sit amet, <i class="red">magna aliquyam erat</i>, duo dolores et ea rebum. <strong>Stet clita kasd gubergren</strong> hello

First, parse it into:

array(
    '0' => array(
        'tag' => '',
        'text' => 'Lorem ipsum dolor sit amet, '
    ),
    '1' => array(
        'tag' => '<i class="red">',
        'text' => 'magna aliquyam erat',
    )
    '2' => ......
    '3' => ......
)

then cut the text one by one, and wrap each one with its tag after cut,

then join them.

Upvotes: 1

Related Questions