Reputation: 517
I am curling from a page with very ill-formed code. There is a particular snippet of the page I am trying to parse into paragraphs. This input snippet may be divided by <p>
and </p>
or separated by one or more <br>
or <br/>
tags. In cases where there are two <br>
tags after another, I don't want those to be two separate pargaraphs.
My current code I'm trying to parse/display with is
$paragraphs = preg_split('/(<\s*p\s*\/?>)|(<\s*br\s*\/?>)|(\s\s+)|(<\s*\/p\s*\/?>)/', $article, -1, PREG_SPLIT_NO_EMPTY);
$paragraphcount = count($paragraphs);
for($x = 1; $x <= $paragraphcount; $x++ )
{
echo "<p>".$paragraphs[$x-1]."</p>";
}
However, this is not working as expected. Some different inputs/outputs are as follows:
Input 1: first part </p> <p> second part </p> <p> third part </p> <p> fourth part <br/>
Output 1: <p>first part </p><p> </p><p>second part </p><p> </p><p> third part </p><p> </p><p>fourth part</p><p> </p>
My code is parsing the input into paragraphs; however, it's also adding extra paragraphs containing only a space.
Any help would be appreciated.
Input is UTF-8 if it makes a difference.
Upvotes: 1
Views: 1162
Reputation: 350365
Here is a solution with preg_replace
:
$article = "first part </p> <p> second part </p> <p> third part </p>
<p> fourth part <br/> <br> fifth part";
$healed = substr(
preg_replace('/(\s*<(\/?p|br)\s*\/?>\s*)+/u', "</p><p>", "<p>$article<p>"),
4, -3);
It first wraps the string in <p>
and then replaces (repetitions of) the variants of breaks by </p><p>
, to finally remove the starting </p>
and ending <p>
. Note that this does not produce an (intermediate) array, but the final string.
echo $healed;
outputs:
<p>first part</p><p>second part</p><p>third part</p><p>fourth part</p><p>fifth part</p>
Note that you need the u
modifier at the end of the regular expression to get UTF-8 support.
If on the other hand you need the paragraphs in an array, then preg_split
is better suited (using the same regular expression):
$paragraphs = preg_split('/(\s*<(\/?p|br)\s*\/?>\s*)+/u',
$article, null, PREG_SPLIT_NO_EMPTY);
If you then write:
foreach ($paragraphs as $paragraph) {
echo "$paragraph\n";
}
You get:
first part
second part
third part
fourth part
fifth part
Upvotes: 2
Reputation: 131
print_r(preg_split('/((<\s*p\s*\/?>\s*)|(<\s*br\s*\/?>\s*)|(\s\s+)|(<\s*\/p\s*\/?>\s*))+/', $article, -1, PREG_SPLIT_NO_EMPTY));
result:
Array
(
[0] => first part
[1] => second part
[2] => third part
[3] => fourth part
)
Upvotes: 2