user1139872
user1139872

Reputation:

In php, how can I use a regular expression to capture everything between two patterns (and the shortest instance of each pattern)?

I must be overcomplicating this, but I can't figure it out for the life of me.

I have a standard html document stored as a string, and I need to get the contents of the paragraph. I'll make an example case.

$stringHTML=
"<html>

<head>
<title>Title</title>
</head>

<body>

<p>This is the first paragraph</p>
<p>This is the second</p>
<p>This is the third</p>
<p>And fourth</p>

</body>
</html>";

If I use

$regex='~(<p>)(.*)(</p>)~i';
preg_match_all($regex, $stringHTML, $newVariable); 

I won't get 4 results. Rather, I'll get 10. I get 10 because the regex matches the first <p> and first </p> as well as the first <p> and fourth </p>

How can I search between two words, and return only the results of whats between each paragraph?

Upvotes: 0

Views: 92

Answers (3)

Shiplu Mokaddim
Shiplu Mokaddim

Reputation: 57690

Use HTML parser like DOM or XPATH to parse HTML. Dont use Regex to parse HTML. Here is how it can be easily parsed by DOMDocument.

$doc = new \DOMDocument;
$doc->loadHTML($stringHTML);
$ps = $doc->getElementsByTagName("p");
for($i=0;$i<$ps->length; $i++){
    echo $ps->item($i)->textContent. "\n";
}

Code in action


Using this RegEx (as you said its a regex practice) you'll get 4 results.

preg_match_all("#<p>(.*)</p>#", $stringHTML, $matches);
print_r($matches[1]);

Here look around syntaxes are used. See the code in action.

Upvotes: 1

revo
revo

Reputation: 48751

Your regex should be /<p>(.*?)<\/p>/i . It will only matches the strings between <p></p> and put it in an array.

you shouldn't do a group : (<p>)

Upvotes: 0

Barmar
Barmar

Reputation: 782295

Use .*? to get the shortest match instead of the longest match.

Upvotes: 0

Related Questions