Reputation: 193
i got a problem occuring when using regular expressions:
php> $html = "<html><head><body><h1>hello world</h1><img src=\"data:rawIMGdata\" /><p/><img src=\"sdfsdf.jpg\" title=\"pic1\" /><p/><div class=\"myclass\"><img src=\"data:imageData\" /></div><img alt=\"bla\" src=\"bla.jpg\" title=\"bla\" /></body></html>";
php> $pat = '/<img.*src="(data:.*)"/m';
php> preg_match_all($pat, $html, $matching);
php> var_dump($matching);
array(2) {
[0]=>
array(1) {
[0]=>
string(169) "<img src="data:rawIMGdata" /><p/><img src="sdfsdf.jpg" title="pic1" /><p/><div class="myclass"><img src="data:imageData" /></div><img alt="bla" src="bla.jpg" title="bla""
}
[1]=>
array(1) {
[0]=>
string(63) "data:imageData" /></div><img alt="bla" src="bla.jpg" title="bla"
}
}
My expected output would be just an occurence of "data:imageData" in the second array and moreover there should be two matches ("data:rawIMGdata")
Did i define my regex a wrong way?
Regards, Broncko
Upvotes: 0
Views: 79
Reputation: 21007
If you are trying to parse valid (almost valid) HTML you may try using tools just for parsing XML like DOM
which allows you to browse trough XML quite effectively.
RegExp will definitely do the job, but once you swap '
for "
or html changes from <img src="">
to <img class="" src="">
you may have an issue.
XML parsing utils also usually take care about escaping and "unescaping" arguments, handles duplicate arguments.
For example use DOMxPath
(here's [tutorial]):
$doc = new DOMDocument;
$doc->Load('book.xml');
$xpath = new DOMXPath($doc);
$query = '//img';
$entries = $xpath->query($query);
foreach ($entries as $entry) {
if( !$entry->hasElement('src')){
continue;
}
$src = $entry->getAttribute( 'src');
if( strncmp( $src, 'data:', 5) != 0){
continue;
}
$content = substr( $src, 5);
// Do whatever you need
}
Upvotes: 1
Reputation: 1239
You might want to consider using DOM Document for parsing HTML, although if this example is a complex as it is going to get then you can probably get away with regex; DOM Document will always be more robust though.
Try this:
/<img.*?src="(data:[^"]*)"/m
The ? sets the * to be non-greedy (so it will get the minimum match, by default it grabs as much as it can)
And rather than match anything, you can match anything that isn't a " with [^"].
The .* before was being greedy and matching up to the " in another element
Upvotes: 1
Reputation: 22705
You're basically telling PCRE to grab too much information. Regular expression matching operators will match as much as possible, which is why you're getting so much extra stuff in your matches. Firstly, switch to using the non-greedy variants for matching the initial whitespace, and or matching the contents of the element. Secondly, introduce a proper delimiter to match the end of the attribute's contents. Here's the pattern you ought to be using:
$pat = '/<img.*?src="(data:[^"]*)"/m';
Upvotes: 1
Reputation: 171
Try using a 'lazy' expression -
$pat = '/<img(.*?)src="(data:.*)"/m';
More information: http://www.regular-expressions.info/repeat.html
Upvotes: 0