Reputation: 11
Ok. Admittedly, I am not the best at working with regular expressions. What I am doing is a screen scrape, then trying to fix the img src values in the embedded images to point back to the original domain. This is the regex I have been trying variations of (too many to list - here's the current one):
preg_match_all('/<img\b[^>]*>/i', $html, $images);
What this ends up doing is to replace all <
with />
. What I need it to do is just return the (currently) five images on the page in an array so that I can work with those to fix their src values, then write them back to $html, which is set at the beginning of the file:
$html = file_get_contents($target_url);
Upvotes: 1
Views: 85
Reputation: 237847
Basically, don't do this with regex. You can parse HTML with regex, but it is almost certainly not worth the effort.
Do it with genuine DOM parsing instead, using the DOMDocument
class:
$dom = new DOMDocument;
$dom->loadHTML($html);
$images = $dom->getElementsByTagName('img');
foreach ($images as $image) {
$image->setAttribute('src', 'http://example.com/' . $image->getAttribute('src'));
}
$html = $dom->saveHTML();
Upvotes: 5