Reputation: 3079
I can't seem to figure out the proper regular expression for extracting just specific numbers from a string. I have an HTML string that has various img tags in it. There are a bunch of img tags in the HTML that I want to extract a portion of the value from. They follow this format:
<img src="http://domain.com/images/59.jpg" class="something" />
<img src="http://domain.com/images/549.jpg" class="something" />
<img src="http://domain.com/images/1249.jpg" class="something" />
<img src="http://domain.com/images/6.jpg" class="something" />
So, varying lengths of numbers before what 'usually' is a .jpg (it may be a .gif, .png, or something else too). I want to only extract the number from that string.
The 2nd part of this is that I want to use that number to look up an entry in a database and grab the alt/title tag for that specific id of image. Lastly, I want to add that returned database value into the string and throw it back into the HTML string.
Any thoughts on how to proceed with it would be great...
Thus far, I've tried:
$pattern = '/img src="http://domain.com/images/[0-9]+\/.jpg';
preg_match_all($pattern, $body, $matches);
var_dump($matches);
Upvotes: 1
Views: 12739
Reputation: 60047
$matches = array();
preg_match_all('/[:digits:]+/', $htmlString, $matches);
Then loop through the matches
array to both reconstruct the HTML and to do you look up in the database.
Upvotes: 0
Reputation: 198237
Regular expressions alone are a bit on the loosing ground when it comes to parsing crappy HTML. DOMDocument
's HTML handling is pretty well to serve tagsoup hot and fresh, xpath to select your image srcs and a simple sscanf to extract the number:
$ids = array();
$doc = new DOMDocument();
$doc->loadHTML($html);
foreach(simplexml_import_dom($doc)->xpath('//img/@src[contains(., "/images/")]') as $src) {
if (sscanf($src, '%*[^0-9]%d', $number)) {
$ids[] = $number;
}
}
Because that only gives you an array, why not encapsulate it?
$html = '<img src="http://domain.com/images/59.jpg" class="something" />
<img src="http://domain.com/images/549.jpg" class="something" />
<img src="http://domain.com/images/1249.jpg" class="something" />
<img src="http://domain.com/images/6.jpg" class="something" />';
$imageNumbers = new ImageNumbers($html);
var_dump((array) $imageNumbers);
Which gives you:
array(4) {
[0]=>
int(59)
[1]=>
int(549)
[2]=>
int(1249)
[3]=>
int(6)
}
By that function above nicely wrapped into an ArrayObject
:
class ImageNumbers extends ArrayObject
{
public function __construct($html) {
parent::__construct($this->extractFromHTML($html));
}
private function extractFromHTML($html) {
$numbers = array();
$doc = new DOMDocument();
$preserve = libxml_use_internal_errors(TRUE);
$doc->loadHTML($html);
foreach(simplexml_import_dom($doc)->xpath('//img/@src[contains(., "/images/")]') as $src) {
if (sscanf($src, '%*[^0-9]%d', $number)) {
$numbers[] = $number;
}
}
libxml_use_internal_errors($preserve);
return $numbers;
}
}
If your HTML should be that malformatted that even DOMDocument::loadHTML()
can't handle it, then you only need to handle that internally in the ImageNumbers
class.
Upvotes: 0
Reputation: 91518
use preg_match_all:
preg_match_all('#<img.*?/(\d+)\.#', $str, $m);
print_r($m);
output:
Array
(
[0] => Array
(
[0] => <img src="http://domain.com/images/59.
[1] => <img src="http://domain.com/images/549.
[2] => <img src="http://domain.com/images/1249.
[3] => <img src="http://domain.com/images/6.
)
[1] => Array
(
[0] => 59
[1] => 549
[2] => 1249
[3] => 6
)
)
Upvotes: 1
Reputation: 59709
I think this is the best approach:
Here is an example. There are improvements I can think of, such as using string manipulation instead of a regex.
$html = '<img src="http://domain.com/images/59.jpg" class="something" />
<img src="http://domain.com/images/549.jpg" class="something" />
<img src="http://domain.com/images/1249.jpg" class="something" />
<img src="http://domain.com/images/6.jpg" class="something" />';
$doc = new DOMDocument;
$doc->loadHtml( $html);
foreach( $doc->getElementsByTagName('img') as $img)
{
$src = $img->getAttribute('src');
preg_match( '#/images/([0-9]+)\.#i', $src, $matches);
$id = $matches[1];
echo 'Fetching info for image ID ' . $id . "\n";
// Query stuff here
$result = 'Got this from the DB';
$img->setAttribute( 'title', $result);
$img->setAttribute( 'alt', $result);
}
$newHTML = $doc->saveHtml();
Upvotes: 2
Reputation: 324840
Consider using preg_replace_callback
.
Use this regex: (images/([0-9]+)[^"]+")
Then, as the callback
argument, use an anonymous function. Result:
$output = preg_replace_callback(
"(images/([0-9]+)[^\"]+\")",
function($m) {
// $m[1] is the number.
$t = getTitleFromDatabase($m[1]); // do whatever you have to do to get the title
return $m[0]." title=\"".$t."\"";
},
$input
);
Upvotes: 1
Reputation: 11520
This regex should match the number parts:
\/images\/(?P<digits>[0-9]+)\.[a-z]+
Your $matches['digits']
should have all of the digits you want as an array.
Upvotes: 0
Reputation: 1146
Using regular expressions, you can get the number really easily. The third argument for preg_match_all is a by-reference array that will be populated with the matches that were found.
preg_match_all('/<img src="http:\/\/domain.com\/images\/(\d+)\.[a-zA-Z]+"/', $html, $matches);
print_r($matches);
This would contain all of the stuff that it found.
Upvotes: 1