How to use regex to match multiple paragraphs?

Question

I would like to process the html from a webpage and extract the paragraphs that match my criteria. The flavor of regex is PHP.

This is the sample webpage HTML:


    Some interesting text I would like to extract
    More interesting text I would like to extract
    Even more interesting text I would like to extract

The regex looks between the

and

tags and puts everything into a capture group or variable for reference in the next step. This next step is what I am having trouble with. I cannot for the life of me write a regex that captures each paragraph of text between

and

.

I have tried /

(.+?)<\/p>/s which returns:

Some interesting text I would like to extract
More interesting text I would like to extract
Even more interesting text I would like to extract

I would like each paragraph to be returned individually as items in an array. The non greedy ? does not seem to work. Any suggestions?

Ray Li · Accepted Answer

So stupid! The regex works perfectly. All the regexs work perfectly. The problem was with the inputs. The input HTML file I was processing had the following structure which made the regex not work.

Some interesting text I would like to extract
More interesting text I would like to extract
Even more interesting text I would like to extract

I used var_dump(htmlfile.html) to see the HTML page I was getting but my browser processed it so I was not getting the raw data. I was able to get the raw data and find my mistake by using:

include 'filename.php'; 
file_put_contents('filename.php', $data);

Now I know to not trust my browser to return raw data ever again!

How to use regex to match multiple paragraphs?

Answers (2)

Related Questions