Pamela
Pamela

Reputation: 517

PHP preg_split Input by <br>, <br/>, <p> into Separate Paragraphs

I am curling from a page with very ill-formed code. There is a particular snippet of the page I am trying to parse into paragraphs. This input snippet may be divided by <p> and </p> or separated by one or more <br> or <br/> tags. In cases where there are two <br> tags after another, I don't want those to be two separate pargaraphs.

My current code I'm trying to parse/display with is

$paragraphs = preg_split('/(<\s*p\s*\/?>)|(<\s*br\s*\/?>)|(\s\s+)|(<\s*\/p\s*\/?>)/', $article, -1, PREG_SPLIT_NO_EMPTY);
$paragraphcount = count($paragraphs);
for($x = 1; $x <= $paragraphcount; $x++ )
    {
    echo "<p>".$paragraphs[$x-1]."</p>";
    }

However, this is not working as expected. Some different inputs/outputs are as follows:

Input 1: first part </p> <p> second part </p> <p> third part </p> <p> fourth part <br/>

Output 1: <p>first part </p><p> </p><p>second part </p><p> </p><p> third part </p><p> </p><p>fourth part</p><p> </p>

My code is parsing the input into paragraphs; however, it's also adding extra paragraphs containing only a space.

Any help would be appreciated.

Input is UTF-8 if it makes a difference.

Upvotes: 1

Views: 1162

Answers (2)

trincot
trincot

Reputation: 350365

Here is a solution with preg_replace:

$article = "first part </p> <p> second part </p> <p> third part </p> 
            <p> fourth part <br/> <br> fifth part";
$healed = substr(
          preg_replace('/(\s*<(\/?p|br)\s*\/?>\s*)+/u', "</p><p>", "<p>$article<p>"),
          4, -3);

It first wraps the string in <p> and then replaces (repetitions of) the variants of breaks by </p><p>, to finally remove the starting </p> and ending <p>. Note that this does not produce an (intermediate) array, but the final string.

echo $healed;

outputs:

<p>first part</p><p>second part</p><p>third part</p><p>fourth part</p><p>fifth part</p>

Note that you need the u modifier at the end of the regular expression to get UTF-8 support.

If on the other hand you need the paragraphs in an array, then preg_split is better suited (using the same regular expression):

$paragraphs = preg_split('/(\s*<(\/?p|br)\s*\/?>\s*)+/u',
                         $article, null, PREG_SPLIT_NO_EMPTY);

If you then write:

foreach ($paragraphs as $paragraph) {
    echo "$paragraph\n";
}

You get:

first part
second part
third part
fourth part
fifth part

Upvotes: 2

Zahur Sh
Zahur Sh

Reputation: 131

print_r(preg_split('/((<\s*p\s*\/?>\s*)|(<\s*br\s*\/?>\s*)|(\s\s+)|(<\s*\/p\s*\/?>\s*))+/', $article, -1, PREG_SPLIT_NO_EMPTY));

result:

Array
(
    [0] => first part 
    [1] => second part 
    [2] => third part 
    [3] => fourth part 
)

Upvotes: 2

Related Questions