Ray Li
Ray Li

Reputation: 7919

How to use regex to match multiple paragraphs?

I would like to process the html from a webpage and extract the paragraphs that match my criteria. The flavor of regex is PHP.

This is the sample webpage HTML:

<div class="special">
    <p>Some interesting text I would like to extract</p>
    <p>More interesting text I would like to extract</p>
    <p>Even more interesting text I would like to extract</p>
</div>

The regex looks between the <div class="special"> and </div> tags and puts everything into a capture group or variable for reference in the next step. This next step is what I am having trouble with. I cannot for the life of me write a regex that captures each paragraph of text between <p> and </p>.

I have tried /<p>(.+?)<\/p>/s which returns:

<p>Some interesting text I would like to extract</p>
<p>More interesting text I would like to extract</p>
<p>Even more interesting text I would like to extract</p>

I would like each paragraph to be returned individually as items in an array. The non greedy ? does not seem to work. Any suggestions?

Upvotes: 1

Views: 1529

Answers (2)

Ray Li
Ray Li

Reputation: 7919

So stupid! The regex works perfectly. All the regexs work perfectly. The problem was with the inputs. The input HTML file I was processing had the following structure which made the regex not work.

<p>Some interesting text I would like to extract
<p>More interesting text I would like to extract
<p>Even more interesting text I would like to extract</p></p></p>

I used var_dump(htmlfile.html) to see the HTML page I was getting but my browser processed it so I was not getting the raw data. I was able to get the raw data and find my mistake by using:

include 'filename.php'; 
file_put_contents('filename.php', $data);

Now I know to not trust my browser to return raw data ever again!

Upvotes: 0

Chin Leung
Chin Leung

Reputation: 14921

You have to escape your slash for the p tag.

So it's going to be

/<p>(.+?)<\/p>/s

Upvotes: 1

Related Questions