Jordan
Jordan

Reputation: 21760

I could use some Regex help

EDIT: What I'm looking for below is a REGEX statement that says something like this:

I'll store these in an array, then I'll fetch the pages. For each page, I'll then need to grab the image url, so I'll need the regex code for that. I know it's brittle, but it'll get the job done for what I need.

I have a page of html, with groups of the following:

<div class='productBundle' id='4086472'>
<table cellpadding="0" cellspacing="0" class='inv'>
<tr><td valign="middle" align="center" width="100%">
<a href="http://listing.com/product/view/4086794.html" alt="472">

I'd like to retrieve all the urls listed under the div class='productBundle'. There could be any number per page, but always under the productBundle div.

Then from those html pages, I need to get product image url

<img id=productImage' src='http://listing.com/item/472248/472.jpg'>

For example, I need "http://listing.com/item/472248/472.jpg" from the html code above.

I could use the help with the REGEX code to grab the pages in the first part, then the REGEX code to grab the url from the productImage.

Thanks

Upvotes: 0

Views: 75

Answers (5)

Jordan
Jordan

Reputation: 21760

This does the trick.

"http:\/\/listing.com\/product+([^""])*html"

Upvotes: 0

eater
eater

Reputation: 2827

Consider: RegEx match open tags except XHTML self-contained tags


Edit to add useful content: That said, this is very brittle, but should work...

Perl for the grabbing the .html URLs:

$/ = undef; # read multiline
$in = <>;   # read file provided on command line
while ($in =~ s/<div class='productBundle'.*?<a href=\"(.*?html)//sm) {
  print "$1\n";
}

Perl for grabbing the .jpg URLs:

$/ = undef; # read multiline
$in = <>;   # read file provided on command line
while ($in =~ s/<img id='productImage'.*?src='(.*?jpg)//sm) {
  print "$1\n";
}

The .*? means match 0 or more characters not greedily which means it will match only up to the first occurrence of whatever follows it. The /sm modifier on the end tells perl that . should also match newlines (which it doesn't by default) and that the input is multi-line.

Upvotes: 1

orangepips
orangepips

Reputation: 9971

HTML Parser that produces an XML representation + XPATH.

  1. Choose an HTML parser for your particular language that produces an XML representation.
  2. Suck in the HTML with the product listing and find the HREFs using this XPath statement //div[@class='productBundle']//a/@href.
  3. Iterate over the results - HTTP GET each href values
  4. For each href value - XPath the response using a parser again for image paths //img/@src.

Upvotes: 0

paxdiablo
paxdiablo

Reputation: 882146

No, what you need help with is processing a markup language, and regular expressions are like using a screwdriver to hammer in a nail.

In other words, you can get it to work but it's a fair bit of effort required to catch all the edge cases.

My suggestion is to use an XML processing tool, the selection of which depends on the language and environment you're using.

Upvotes: 3

cdhowie
cdhowie

Reputation: 169221

You should really use XPath for this instead. Load the document into whatever container your framework provides that supports XPath, and issue this query:

//div[@class='ProductBundle']//img/@src

The result will be the list of strings you need.

Upvotes: 1

Related Questions