Reputation: 7919
I would like to process the html from a webpage and extract the paragraphs that match my criteria. The flavor of regex is PHP.
This is the sample webpage HTML:
<div class="special">
<p>Some interesting text I would like to extract</p>
<p>More interesting text I would like to extract</p>
<p>Even more interesting text I would like to extract</p>
</div>
The regex looks between the <div class="special">
and </div>
tags and puts everything into a capture group or variable for reference in the next step. This next step is what I am having trouble with. I cannot for the life of me write a regex that captures each paragraph of text between <p>
and </p>
.
I have tried /<p>(.+?)<\/p>/s
which returns:
<p>Some interesting text I would like to extract</p>
<p>More interesting text I would like to extract</p>
<p>Even more interesting text I would like to extract</p>
I would like each paragraph to be returned individually as items in an array. The non greedy ?
does not seem to work. Any suggestions?
Upvotes: 1
Views: 1529
Reputation: 7919
So stupid! The regex works perfectly. All the regexs work perfectly. The problem was with the inputs. The input HTML file I was processing had the following structure which made the regex not work.
<p>Some interesting text I would like to extract
<p>More interesting text I would like to extract
<p>Even more interesting text I would like to extract</p></p></p>
I used var_dump(htmlfile.html) to see the HTML page I was getting but my browser processed it so I was not getting the raw data. I was able to get the raw data and find my mistake by using:
include 'filename.php';
file_put_contents('filename.php', $data);
Now I know to not trust my browser to return raw data ever again!
Upvotes: 0
Reputation: 14921
You have to escape your slash for the p tag.
So it's going to be
/<p>(.+?)<\/p>/s
Upvotes: 1