Reputation: 2862
With PHP, how can I isolate the contents of the src attribute from $foo? The end result I'm looking for would give me just "http://example.com/img/image.jpg"
$foo = '<img class="foo bar test" title="test image" src="http://example.com/img/image.jpg" alt="test image" width="100" height="100" />';
Upvotes: 42
Views: 69498
Reputation: 1334
<?php
$html = '
<img border="0" src="/images/image1.jpg" alt="Image" width="100" height="100" />
<img border="0" src="/images/image2.jpg" alt="Image" width="100" height="100" />
<img border="0" src="/images/image3.jpg" alt="Image" width="100" height="100" />
';
$get_Img_Src = '/<img[^>]*src=([\'"])(?<src>.+?)\1[^>]*>/i'; //for get img src path only...
preg_match_all($get_Img_Src, $html, $result);
if (!empty($result)) {
echo $result['src'][0];
echo $result['src'][1];
}
for get img src path & alt text also then use below regex instead of above...
<img[^>]*src=(['"])(?.+?)\1[^>]alt=(['"])(?.+?)\2>
$get_Img_Src = '/<img[^>]*src=([\'"])(?<src>.+?)\1[^>]*alt=([\'"])(?<alt>.+?)\2*>/i'; //for get img src path & alt text also
preg_match_all($get_Img_Src, $html, $result);
if (!empty($result)) {
echo $result['src'][0];
echo $result['src'][1];
echo $result['alt'][0];
echo $result['alt'][1];
}
I got idea of that great solution from here, PHP extract link from a href tag
For Extract Urls of specific domains only then try below regex
// for e.g. if you need to extract onlt urls of "test.com"
// then you can do it as like below regex
<a[^>]+href=([\'"])(?<href>(https?:\/\/)?test\.com.* ?)\1[^>]*>
for get img src attributes that contain base64 encoded data, you can do it as like below. you can test it on here onlinephp.io
<?php
$html = '
<p>test </p>
<img border="0" src="/images/image1.jpg" alt="Image" width="100" height="100" />
<img border="0" src="/images/image2.jpg" alt="Image" width="100" height="100" />
<img border="0" src="/images/image3.jpg" alt="Image" width="100" height="100" />
<img border="0" src="data:image/png;base64,iVBORw0KGgoAAAANSUhEUgAAAJUAAAAfCAYAAADuiY/xAAAAGXRF..." alt="Base64 Image 1" width="100" height="100" />
<img border="0" src="data:image/jpeg;base64,/9j/4AAQSkZJRgABAQEAAAAAAAAAAAAAAEAAQAAAQAAAQEAAEAAAD/..." alt="Base64 Image 2" width="100" height="100" />
<h1>asas</h1>
<img border="0" src="/images/image2.jpg" alt="Image" width="100" height="100" />
<img border="0" src="http://test.com/images/image2.jpg" width="100" height="100" />
<img src="data:image/gif;base64,R0lGODlhPQBEAP8A..." />
';
$get_Img_Src = '/<img[^>]*src=["\'](data:image\/[^;]+;base64[^"\']+)["\'][^>]*>/i'; // Regex to capture base64 image src
preg_match_all($get_Img_Src, $html, $result);
// Debugging step: print the entire result array
//echo "Full result:\n";
//print_r($result);
if (!empty($result[1])) {
echo "Base64 matches found: " . count($result[1]) . PHP_EOL;
$updated_html = $html;
// Access the base64 data in the first capture group, i.e. $result[1]
foreach ($result[1] as $index => $base64) {
echo $base64 . PHP_EOL; // Echo each base64 encoded image string
// Replace the base64 string with the corresponding URL
$updated_html = str_replace($base64, 'http://demo.com/replaced-url-' . ($index + 1), $updated_html);
}
} else {
echo "No base64 images found." . PHP_EOL;
}
echo PHP_EOL . PHP_EOL . PHP_EOL . PHP_EOL . PHP_EOL . PHP_EOL . "see updated html: url instead of base64";
echo PHP_EOL . $updated_html;
?>
Upvotes: 1
Reputation: 423
I use preg_match_all to capture all images in HTML document:
preg_match_all("~<img.*src\s*=\s*[\"']([^\"']+)[\"'][^>]*>~i", $body, $matches);
This one allows more relaxed syntax of declaration, with spaces and different quote types.
Regex reads like <img (any attributes like style or border) src (possible space) = (possible space) (' or ") (any non-quote symbol) (' or ") (anything until >) (>)
Upvotes: 0
Reputation: 379
lets assume i use
$text ='<img src="blabla.jpg" alt="blabla" />';
in
getTextBetween('src="','"',$text);
the codes will return :
blabla.jpg" alt="blabla"
which is wrong, we want the codes to return the text between the attribute value quotes i.e attr = "value".
so
function getTextBetween($start, $end, $text)
{
// explode the start string
$first_strip= end(explode($start,$text,2));
// explode the end string
$final_strip = explode($end,$first_strip)[0];
return $final_strip;
}
does the trick!.
Try
getTextBetween('src="','"',$text);
will return:
blabla.jpg
Thanks all the same , because your solution gave me an insight to the final solution .
Upvotes: -1
Reputation: 8149
I'm extremely late to this, but I have a simple solution not yet mentioned. Load it with simplexml_load_string
(if you have simplexml enabled) and then flip it through json_encode
and json_decode
.
$foo = '<img class="foo bar test" title="test image" src="http://example.com/img/image.jpg" alt="test image" width="100" height="100" />';
$parsedFoo = json_decode(json_encode(simplexml_load_string($foo)), true);
var_dump($parsedFoo['@attributes']['src']); // output: "http://example.com/img/image.jpg"
$parsedFoo
comes through as
array(1) {
["@attributes"]=>
array(6) {
["class"]=>
string(12) "foo bar test"
["title"]=>
string(10) "test image"
["src"]=>
string(32) "http://example.com/img/image.jpg"
["alt"]=>
string(10) "test image"
["width"]=>
string(3) "100"
["height"]=>
string(3) "100"
}
}
I've been using this for parsing XML and HTML for a few months now and it works pretty well. I've had no hiccups yet, though I haven't had to parse a large file with it (I imagine using json_encode
and json_decode
like that will get slower the larger the input gets). It's convoluted, but it's by far the easiest way to read HTML properties.
Upvotes: 4
Reputation: 8496
I got this code:
$dom = new DOMDocument();
$dom->loadHTML($img);
echo $dom->getElementsByTagName('img')->item(0)->getAttribute('src');
Assuming there is only one img :P
Upvotes: 9
Reputation: 8017
You can go around this problem using this function:
function getTextBetween($start, $end, $text) { $start_from = strpos($text, $start); $start_pos = $start_from + strlen($start); $end_pos = strpos($text, $end, $start_pos + 1); $subtext = substr($text, $start_pos, $end_pos); return $subtext; }
$foo = '<img class="foo bar test" title="test image" src="http://example.com/img/image.jpg" alt="test image" width="100" height="100" />';
$img_src = getTextBetween('src="', '"', $foo);
Upvotes: 1
Reputation: 2862
Here's what I ended up doing, although I'm not sure about how efficient this is:
$imgsplit = explode('"',$data);
foreach ($imgsplit as $item) {
if (strpos($item, 'http') !== FALSE) {
$image = $item;
break;
}
}
Upvotes: 2
Reputation: 54445
If you don't wish to use regex (or any non-standard PHP components), a reasonable solution using the built-in DOMDocument class would be as follows:
<?php
$doc = new DOMDocument();
$doc->loadHTML('<img src="http://example.com/img/image.jpg" ... />');
$imageTags = $doc->getElementsByTagName('img');
foreach($imageTags as $tag) {
echo $tag->getAttribute('src');
}
?>
Upvotes: 78
Reputation: 342635
// Create DOM from string
$html = str_get_html('<img class="foo bar test" title="test image" src="http://example.com/img/image.jpg" alt="test image" width="100" height="100" />');
// echo the src attribute
echo $html->find('img', 0)->src;
http://simplehtmldom.sourceforge.net/
Upvotes: 7
Reputation: 5417
<?php
$foo = '<img class="foo bar test" title="test image" src="http://example.com/img/image.jpg" alt="test image" width="100" height="100" />';
$array = array();
preg_match( '/src="([^"]*)"/i', $foo, $array ) ;
print_r( $array[1] ) ;
http://example.com/img/image.jpg
Upvotes: 40
Reputation: 115
try this pattern:
'/< \s* img [^\>]* src \s* = \s* [\""\']? ( [^\""\'\s>]* )/'
Upvotes: 0