Ahmad Fouad
Ahmad Fouad

Reputation: 4107

Close open HTML tags in a string

Situation is a string that results in something like this:

<p>This is some text and here is a <strong>bold text then the post stop here....</p>

Because the function returns a teaser (summary) of the text, it stops after certain words. Where in this case the tag strong is not closed. But the whole string is wrapped in a paragraph.

Is it possible to convert the above result/output to the following:

<p>This is some text and here is a <strong>bold text then the post stop here....</strong></p>

I do not know where to begin. The problem is that.. I found a function on the web which does it regex, but it puts the closing tag after the string.. therefore it won't validate because I want all open/close tags within the paragraph tags. The function I found does this which is wrong also:

<p>This is some text and here is a <strong>bold text then the post stop here....</p></strong>

I want to know that the tag can be strong, italic, anything. That's why I cannot append the function and close it manually in the function. Any pattern that can do it for me?

Upvotes: 24

Views: 31967

Answers (10)

V.Volkov
V.Volkov

Reputation: 739

An up-to-date solution with parsing HTML would be:

function fix_html($html) {
    $dom = new DOMDocument();
    $dom->loadHTML( mb_convert_encoding( $html, 'HTML-ENTITIES', 'UTF-8' ), LIBXML_HTML_NOIMPLIED | LIBXML_HTML_NODEFDTD );
    return $dom->saveHTML();
}

LIBXML_HTML_NOIMPLIED | LIBXML_HTML_NODEFDTD is needed to avoid implementing doctype, html and body.. the rest looks pretty obvious :)

UPDATE: After some testing noticed, that the solution above ruins a correct layout time-after-time. The following works well, though:

function fix_html($html) {
    $dom = new DOMDocument();
    $dom->loadHTML( mb_convert_encoding( $html, 'HTML-ENTITIES', 'UTF-8' ) );
    $return = '';
    foreach ( $dom->getElementsByTagName( 'body' )->item(0)->childNodes as $v ) {
        $return .= $dom->saveHTML( $v );
    }
    return $return;
}

Upvotes: 0

Karan Patel
Karan Patel

Reputation: 19

This is works for me to close any open HTML tags in a script.

<?php
function closetags($html) {
preg_match_all('#<([a-z]+)(?: .*)?(?<![/|/ ])>#iU', $html, $result);
$openedtags = $result[1];
preg_match_all('#</([a-z]+)>#iU', $html, $result);
$closedtags = $result[1];
$len_opened = count($openedtags);
if (count($closedtags) == $len_opened) {
    return $html;
}
$openedtags = array_reverse($openedtags);
for ($i=0; $i < $len_opened; $i++) {
    if (!in_array($openedtags[$i], $closedtags)) {
        $html .= '</'.$openedtags[$i].'>';
    } else {
        unset($closedtags[array_search($openedtags[$i], $closedtags)]);
    }
}
return $html;
}

Upvotes: 0

Pavel Ř&#237;ha
Pavel Ř&#237;ha

Reputation: 51

And what about using PHP's native DOMDocument class? It inherently parses HTML and corrects syntax errors... E.g.:

$fragment = "<article><h3>Title</h3><p>Unclosed";
$doc = new DOMDocument();
$doc->loadHTML($fragment);
$correctFragment = $doc->getElementsByTagName('body')->item(0)->C14N();
echo $correctFragment;

However, there are several disadvantages of this approach. Firstly, it wraps the original fragment within the <body> tag. You can get rid of it easily by something like (preg_)replace() or by substituting the ...->C14N() function by some custom innerHTML() function, as suggested for example at http://php.net/manual/en/book.dom.php#89718. The second pitfall is that PHP throws an 'invalid tag in Entity' warning if HTML5 or custom tags are used (nevertheless, it will still proceed correctly).

Upvotes: 5

Luca C.
Luca C.

Reputation: 12594

if tidy module is installed, use php tidy extension:

tidy_repair_string($html)

reference

Upvotes: 1

Andrew
Andrew

Reputation: 1150

This PHP method always worked for me. It will close all un-closed HTML tags.

function closetags($html) {
    preg_match_all('#<([a-z]+)(?: .*)?(?<![/|/ ])>#iU', $html, $result);
    $openedtags = $result[1];

    preg_match_all('#</([a-z]+)>#iU', $html, $result);
    $closedtags = $result[1];
    $len_opened = count($openedtags);
    if (count($closedtags) == $len_opened) {
        return $html;
    }
    $openedtags = array_reverse($openedtags);
    for ($i=0; $i < $len_opened; $i++) {
        if (!in_array($openedtags[$i], $closedtags)){
            $html .= '</'.$openedtags[$i].'>';
        } else {
            unset($closedtags[array_search($openedtags[$i], $closedtags)]);
        }
    }
    return $html;
}

Upvotes: 4

Poupoudoum
Poupoudoum

Reputation: 31

I've done this code witch doest the job quite correctly...

It's old school but efficient and I've added a flag to remove the unfinished tags such as " blah blah http://stackoverfl"

public function getOpennedTags(&$string, $removeInclompleteTagEndTagIfExists = true) {

    $tags = array();
    $tagOpened = false;
    $tagName = '';
    $tagNameLogged = false;
    $closingTag = false;

    foreach (str_split($string) as $c) {
        if ($tagOpened && $c == '>') {
            $tagOpened = false;
            if ($closingTag) {
                array_pop($tags);
                $closingTag = false;
                $tagName = '';
            }
            if ($tagName) {
                array_push($tags, $tagName);
            }
        }
        if ($tagOpened && $c == ' ') {
            $tagNameLogged = true;
        }
        if ($tagOpened && $c == '/') {
            if ($tagName) {
                //orphan tag
                $tagOpened = false;
                $tagName = '';
            } else {
                //closingTag
                $closingTag = true;
            }
        }
        if ($tagOpened && !$tagNameLogged) {
            $tagName .= $c;
        }
        if (!$tagOpened && $c == '<') {
            $tagNameLogged = false;
            $tagName = '';
            $tagOpened = true;
            $closingTag = false;
        }
    }

    if ($removeInclompleteTagEndTagIfExists && $tagOpened) {
        // an tag has been cut for exemaple ' blabh blah <a href="sdfoefzofk' so closing the tag will not help...
        // let's remove this ugly piece of tag
        $pos = strrpos($string, '<');
        $string = substr($string, 0, $pos);
    }

    return $tags;
}

Usage example :

$tagsToClose = $stringHelper->getOpennedTags($val);
$tagsToClose = array_reverse($tagsToClose);

foreach ($tagsToClose as $tag) {
    $val .= "</$tag>";
}

Upvotes: 0

Markus
Markus

Reputation: 1557

A small modification to the original answer...while the original answer stripped tags correctly. I found that during my truncation, I could end up with chopped up tags. For example:

This text has some <b>in it</b>

Truncating at character 21 results in:

This text has some <

The following code, builds on the next best answer and fixes this.

function truncateHTML($html, $length)
{
    $truncatedText = substr($html, $length);
    $pos = strpos($truncatedText, ">");
    if($pos !== false)
    {
        $html = substr($html, 0,$length + $pos + 1);
    }
    else
    {
        $html = substr($html, 0,$length);
    }

    preg_match_all('#<(?!meta|img|br|hr|input\b)\b([a-z]+)(?: .*)?(?<![/|/ ])>#iU', $html, $result);
    $openedtags = $result[1];

    preg_match_all('#</([a-z]+)>#iU', $html, $result);
    $closedtags = $result[1];

    $len_opened = count($openedtags);

    if (count($closedtags) == $len_opened)
    {
        return $html;
    }

    $openedtags = array_reverse($openedtags);
    for ($i=0; $i < $len_opened; $i++)
    {
        if (!in_array($openedtags[$i], $closedtags))
        {
            $html .= '</'.$openedtags[$i].'>';
        }
        else
        {
            unset($closedtags[array_search($openedtags[$i], $closedtags)]);
        }
    }


    return $html;
}


$str = "This text has <b>bold</b> in it</b>";
print "Test 1 - Truncate with no tag: " . truncateHTML($str, 5) . "<br>\n";
print "Test 2 - Truncate at start of tag: " . truncateHTML($str, 20) . "<br>\n";
print "Test 3 - Truncate in the middle of a tag: " . truncateHTML($str, 16) . "<br>\n";
print "Test 4: - Truncate with less text: " . truncateHTML($str, 300) . "<br>\n";

Hope it helps someone out there.

Upvotes: 10

Russell Dias
Russell Dias

Reputation: 73402

There are numerous other variables that need to be addressed to give a full solution, but are not covered by your question.

However, I would suggest using something like HTML Tidy and in particular the repairFile or repaireString methods.

Upvotes: 3

JoshD
JoshD

Reputation: 12824

Using a regular expression isn't an ideal approach for this. You should use an html parser instead to create a valid document object model.

As a second option, depending on what you want, you could use a regex to remove any and all html tags from your string before you put it in the <p> tag.

Upvotes: 0

alexn
alexn

Reputation: 59022

Here is a function i've used before, which works pretty well:

function closetags($html) {
    preg_match_all('#<(?!meta|img|br|hr|input\b)\b([a-z]+)(?: .*)?(?<![/|/ ])>#iU', $html, $result);
    $openedtags = $result[1];
    preg_match_all('#</([a-z]+)>#iU', $html, $result);
    $closedtags = $result[1];
    $len_opened = count($openedtags);
    if (count($closedtags) == $len_opened) {
        return $html;
    }
    $openedtags = array_reverse($openedtags);
    for ($i=0; $i < $len_opened; $i++) {
        if (!in_array($openedtags[$i], $closedtags)) {
            $html .= '</'.$openedtags[$i].'>';
        } else {
            unset($closedtags[array_search($openedtags[$i], $closedtags)]);
        }
    }
    return $html;
} 

Personally though, I would not do it using regexp but a library such as Tidy. This would be something like the following:

$str = '<p>This is some text and here is a <strong>bold text then the post stop here....</p>';
$tidy = new Tidy();
$clean = $tidy->repairString($str, array(
    'output-xml' => true,
    'input-xml' => true
));
echo $clean;

Upvotes: 48

Related Questions