Reputation: 1578
I have a HTML file which contains the following:
<img src="MATCH1" bla="blabla">
<something:else bla="blabla" bla="bla"><something:else2 something="something">
<something image="MATCH2" bla="abc">
Now I need a regex to match both MATCH1 and MATCH2
Also the HTML contains multiple parts like this, so it can be in the HTML 1, 2, 3 of x times..
When I say:
<img\s*src="(.*?)".*?<something\s*image="(.*?)"
It doesn't match it. What am I missing here?
Thanks in advance!
Upvotes: 3
Views: 476
Reputation: 42093
Regex does not always provide perfect result while parsing HTML.
I think you should do it using HTML DOM Parser
For Example:
// Create DOM from URL or file
$html = file_get_html('http://www.example.com/');
// OR Create a DOM object from a HTML file
$html = file_get_html('test.htm');
// Find all images
foreach($html->find('img') as $element)
echo $element->src . '<br>';
// Find all links
foreach($html->find('a') as $element)
echo $element->href . '<br>';
There are filters to get tags with specific attributes:
[attribute] Matches elements that have the specified attribute.
[attribute=value] Matches elements that have the specified attribute with a certain value.
[attribute!=value] Matches elements that don't have the specified attribute with a certain value.
[attribute^=value] Matches elements that have the specified attribute and it starts with a certain value.
[attribute$=value] Matches elements that have the specified attribute and it ends with a certain value.
[attribute*=value] Matches elements that have the specified attribute and it contains a certain value.
There are also some other Parsing Tools to parse HTML as described in this answer.
Upvotes: 10
Reputation: 145482
Hmmm, I'll better elaborate before more anti-regex memers come around. In your case it is actually applicable to use regular expressions. However I'd like to point out, that you should carefully evaluate on the pros and cons.
It's mostly simpler to use phpQuery or QueryPath for such tasks:
qp($html)->find("img")->attr("src");
But a regex is possible too, if you don't overlook the gritty details:
preg_match('#<img[^>]+src="([^">]*)".+?<something\s[^>]*image="([^">]*)"#ims', $html, $m);
If extraction depends on the presence of both tags, then it might be a better option here.
Upvotes: 2