Reputation: 21760
EDIT: What I'm looking for below is a REGEX statement that says something like this:
I'll store these in an array, then I'll fetch the pages. For each page, I'll then need to grab the image url, so I'll need the regex code for that. I know it's brittle, but it'll get the job done for what I need.
I have a page of html, with groups of the following:
<div class='productBundle' id='4086472'>
<table cellpadding="0" cellspacing="0" class='inv'>
<tr><td valign="middle" align="center" width="100%">
<a href="http://listing.com/product/view/4086794.html" alt="472">
I'd like to retrieve all the urls listed under the div class='productBundle'. There could be any number per page, but always under the productBundle div.
Then from those html pages, I need to get product image url
<img id=productImage' src='http://listing.com/item/472248/472.jpg'>
For example, I need "http://listing.com/item/472248/472.jpg" from the html code above.
I could use the help with the REGEX code to grab the pages in the first part, then the REGEX code to grab the url from the productImage.
Thanks
Upvotes: 0
Views: 75
Reputation: 2827
Consider: RegEx match open tags except XHTML self-contained tags
Edit to add useful content: That said, this is very brittle, but should work...
Perl for the grabbing the .html URLs:
$/ = undef; # read multiline
$in = <>; # read file provided on command line
while ($in =~ s/<div class='productBundle'.*?<a href=\"(.*?html)//sm) {
print "$1\n";
}
Perl for grabbing the .jpg URLs:
$/ = undef; # read multiline
$in = <>; # read file provided on command line
while ($in =~ s/<img id='productImage'.*?src='(.*?jpg)//sm) {
print "$1\n";
}
The .*?
means match 0 or more characters not greedily which means it will match only up to the first occurrence of whatever follows it. The /sm
modifier on the end tells perl that .
should also match newlines (which it doesn't by default) and that the input is multi-line.
Upvotes: 1
Reputation: 9971
HTML Parser that produces an XML representation + XPATH.
//div[@class='productBundle']//a/@href
. //img/@src
.Upvotes: 0
Reputation: 882146
No, what you need help with is processing a markup language, and regular expressions are like using a screwdriver to hammer in a nail.
In other words, you can get it to work but it's a fair bit of effort required to catch all the edge cases.
My suggestion is to use an XML processing tool, the selection of which depends on the language and environment you're using.
Upvotes: 3
Reputation: 169221
You should really use XPath for this instead. Load the document into whatever container your framework provides that supports XPath, and issue this query:
//div[@class='ProductBundle']//img/@src
The result will be the list of strings you need.
Upvotes: 1