Reputation:
I must be overcomplicating this, but I can't figure it out for the life of me.
I have a standard html document stored as a string, and I need to get the contents of the paragraph. I'll make an example case.
$stringHTML=
"<html>
<head>
<title>Title</title>
</head>
<body>
<p>This is the first paragraph</p>
<p>This is the second</p>
<p>This is the third</p>
<p>And fourth</p>
</body>
</html>";
If I use
$regex='~(<p>)(.*)(</p>)~i';
preg_match_all($regex, $stringHTML, $newVariable);
I won't get 4 results. Rather, I'll get 10. I get 10 because the regex matches the first <p>
and first </p>
as well as the first <p>
and fourth </p>
How can I search between two words, and return only the results of whats between each paragraph?
Upvotes: 0
Views: 92
Reputation: 57690
Use HTML parser like DOM or XPATH to parse HTML. Dont use Regex to parse HTML. Here is how it can be easily parsed by DOMDocument.
$doc = new \DOMDocument;
$doc->loadHTML($stringHTML);
$ps = $doc->getElementsByTagName("p");
for($i=0;$i<$ps->length; $i++){
echo $ps->item($i)->textContent. "\n";
}
Using this RegEx (as you said its a regex practice) you'll get 4 results.
preg_match_all("#<p>(.*)</p>#", $stringHTML, $matches);
print_r($matches[1]);
Here look around syntaxes are used. See the code in action.
Upvotes: 1
Reputation: 48751
Your regex should be /<p>(.*?)<\/p>/i
. It will only matches the strings between <p></p>
and put it in an array.
you shouldn't do a group : (<p>)
Upvotes: 0