Reputation: 2625

How to split long HTML content to multiple div without breaking words or formatting in php

For now I got:

public static function splitContent($string, $lenght,  $maxCols){
        if (strlen($string)<($lenght*$maxCols) && strlen($string)> $lenght){
            $string = wordwrap($string, $lenght, "||"); //assume your string doesn't contain `||`
            $parts = explode("||", $string);
            $result='';
            foreach ($parts as $part){
                $result=$result.'<div>'.$part.'</div>';
            }
            return $result;
        }
        return $string;
    }

and it works well when it comes to not breaking words but it often split HTML formatting tags like <span </div><div> style=....> how to prevent that? I see there is many problems like this when splitting html formatted string. Does anyone know about library to do it without hassle. it would be great if it would count only visible characters

Upvotes: 3

Answers (2)

Zoltán Süle

Reputation: 1694

I had to split any random HTML text into 2 equal parts to display them in 2 columns next to each other.

The logic below splits the HTML into 2 parts taking into account the word boundaries and the HTML tags. You can extend it splitting the HTML into multiple divs with a bit more effort.

I have used @jave.web's logic to close the undisclosed HTML tags.

// splitHtmlTextIntoTwoEqualColumnsTrait.php
<?php

/**
 * TCPDF doesn't support to have a 2 columns text where the length of the text is limited and the height of the 2 columns are equal.
 *
 * This trait calculates the middle of the text, split it into 2 parts and returns with them
 * Keeps the word boundaries and takes care of the HTML tags too! There is no broken HTML tag after the split.
 */
trait splitHtmlTextIntoTwoEqualColumnsTrait
{
    protected function splitHtmlTextIntoTwoEqualColumns(string $htmlText): array
    {
        // removes unnecessary characters and HTML tags
        $htmlText = str_replace("\xc2\xa0", ' ', $htmlText);
        $htmlText = html_entity_decode($htmlText);
        $pureText = $this->getPureText($htmlText);

        // calculates the length of the text
        $fullLength = strlen($pureText);
        $halfLength = ceil($fullLength / 2);

        $words = explode(' ', $pureText);

        // finds the word which is in the middle of the text
        $middleWordPosition = $this->getPositionOfMiddleWord($words, $halfLength);

        // iterates through the HTML and split the text into 2 parts when it reaches the middle word.
        $columns = $this->splitHtmlStringInto2Strings($htmlText, $middleWordPosition);

        return $this->closeUnclosedHtmlTags($columns, $halfLength * 2);
    }

    private function getPureText(string $htmlText): string
    {
        $pureText = strip_tags($htmlText);
        $pureText = preg_replace('/[\x00-\x1F\x7F]/', '', $pureText);

        return str_replace(["\r\n", "\r", "\n"], ['', '', ''], $pureText);
    }

    /**
     * finds the word which is in the middle of the text
     */
    private function getPositionOfMiddleWord(array $words, int $halfLength): int
    {
        $wordPosition = 0;
        $stringLength = 0;
        for ($p = 0; $p < count($words); $p++) {
            $stringLength += mb_strlen($words[$p], 'UTF-8') + 1;
            if ($stringLength > $halfLength) {
                $wordPosition = $p;
                break;
            }
        }

        return $wordPosition;
    }

    /**
     * iterates through the HTML and split the text into 2 parts when it reaches the middle word.
     */
    private function splitHtmlStringInto2Strings(string $htmlText, int $wordPosition): array
    {
        $columns = [
            1 => '',
            2 => '',
        ];

        $columnId    = 1;
        $wordCounter = 0;
        $inHtmlTag   = false;
        for ($s = 0; $s <= strlen($htmlText) - 1; $s++) {
            if ($inHtmlTag === false && $htmlText[$s] === '<') {
                $inHtmlTag = true;
            }

            if ($inHtmlTag === true) {
                $columns[$columnId] .= $htmlText[$s];
                if ($htmlText[$s] === '>') {
                    $inHtmlTag = false;
                }
            } else {
                if ($htmlText[$s] === ' ') {
                    $wordCounter++;
                }
                if ($wordCounter > $wordPosition && $columnId < 2) {
                    $columnId++;
                    $wordCounter = 0;
                }

                $columns[$columnId] .= $htmlText[$s];
            }
        }

        return array_map('trim', $columns);
    }

    private function closeUnclosedHtmlTags(array $columns, int $maxLength): array
    {
        $column1      = $columns[1];
        $unclosedTags = $this->getUnclosedHtmlTags($columns[1], $maxLength);
        foreach (array_reverse($unclosedTags) as $tag) {
            $column1 .= '</' . $tag . '>';
        }

        $column2 = '';
        foreach ($unclosedTags as $tag) {
            $column2 .= '<' . $tag . '>';
        }
        $column2 .= $columns[2];

        return [$column1, $column2];
    }

    /**
     * https://stackoverflow.com/a/26175271/5356216
     */
    private function getUnclosedHtmlTags(string $html, int $maxLength = 250): array
    {
        $htmlLength = strlen($html);
        $unclosed   = [];
        $counter    = 0;
        $i          = 0;
        while (($i < $htmlLength) && ($counter < $maxLength)) {
            if ($html[$i] == "<") {
                $currentTag = "";
                $i++;
                if (($i < $htmlLength) && ($html[$i] != "/")) {
                    while (($i < $htmlLength) && ($html[$i] != ">") && ($html[$i] != "/")) {
                        $currentTag .= $html[$i];
                        $i++;
                    }
                    if ($html[$i] == "/") {
                        do {
                            $i++;
                        } while (($i < $htmlLength) && ($html[$i] != ">"));
                    } else {
                        $currentTag = explode(" ", $currentTag);
                        $unclosed[] = $currentTag[0];
                    }
                } elseif ($html[$i] == "/") {
                    array_pop($unclosed);
                    do {
                        $i++;
                    } while (($i < $htmlLength) && ($html[$i] != ">"));
                }
            } else {
                $counter++;
            }
            $i++;
        }

        return $unclosed;
    }

}

how to use it:

// yourClass.php
<?php
declare(strict_types=1);

class yourClass
{
    use splitHtmlTextIntoTwoEqualColumnsTrait;

    public function do()
    {
        // your logic
        $htmlString = '';
        [$column1, $column2] = $this->splitHtmlTextIntoTwoEqualColumns($htmlString);
    }

}

Upvotes: 1

jave.web

Reputation: 15032

From what I know this can not be achieved by simple string splitting because as you already found out - there is a very high possibility of breaking html.

However you could:

1) Load the HTML string char by char and track tags' structure

as mantioned in this answer: Split a html code in two equal content parts, in PHP or JS

2) Load the HTML as an object and count elements' text nodes

2.1) For loading you could use

DOM - http://php.net/manual/en/book.dom.php
SimpleXML - http://php.net/manual/en/book.simplexml.php
There are many more PHP libraries that handles HTML load

2.2) Go through loaded elements and count their text nodes

Use an algorithm that goes through the code
Count text nodes until the count is the desired length
After that clean all text nodes that would be next in display

As for visible characters - PHP itself doesn't know what CSS your elements have - but e.g. if you would load it as an object you could getAttribute('style') and search your "hide css" in that :)

Note: both cases 1) and 2) requires a bit performance, sou if you are applying this to some higher traffic site you should consider some kind of caching for these results.

EDIT: ad 1)

I've created example function on how to track open tags

NOTE: this function assumes XHTML ! (expects selfclosing tags as <img> to be selfeclosed as <img /> And please note that I just made this quick so it might not be best nor efficiant way to do it :)

You can see it work at http://ideone.com/erSDlg

//PHP
function closeTags( &$html, $length = 20 ){
    $htmlLength = strlen($html);
    $unclosed = array();
    $counter = 0;
    $i=0;
    while( ($i<$htmlLength) && ($counter<$length) ){
        if( $html[$i]=="<" ){
            $currentTag = "";
            $i++;
            if( ($i<$htmlLength) && ($html[$i]!="/") ){
                while( ($i<$htmlLength) && ($html[$i]!=">") && ($html[$i]!="/") ){
                    $currentTag .= $html[$i];
                    $i++;
                }
                if( $html[$i] == "/" ){  
                    do{ $i++; } while( ($i<$htmlLength) && ($html[$i]!=">") );  
                } else {
                    $currentTag = explode(" ", $currentTag);
                    $unclosed[] = $currentTag[0];
                }
            } elseif( $html[$i]=="/" ){
                array_pop($unclosed);
                do{ $i++; } while( ($i<$htmlLength) && ($html[$i]!=">") );
            }
        } else{
            $counter++; 
        }
        $i++;
    }
    $result = substr($html, 0, $i-1);
    $unclosed = array_reverse( $unclosed );
    foreach( $unclosed as $tag ) $result .= '</'.$tag.'>';
    print_r($result);
}

$html = "<div>123890<span>1234<img src='i.png' /></span>567890<div><div style='test' class='nice'>asfaasf";
closeTags( $html, 20 );

Upvotes: 2