Reputation: 5741
Due to redbubble.com's lack of an API, I'm using an ATOM feed to steal information about a user's pictures.
This is what the XML looks like:
<entry>
<id>ID</id>
<published>Date Published</published>
<updated>Date Updated</updated>
<link type="text/html" rel="alternate" href="http://www.redbubble.com/link/to/post"/>
<title>Title</title>
<content type="html">
Blah blah blah stuff about the image..
<a href="http://www.redbubble.com/products/configure/config-id"><img src="http://ih1.redbubble.net/path-to-image" alt="" />
</content>
<author>
<name>Author Name</name>
<uri>http://www.redbubble.com/people/author-user-name</uri>
</author>
<link type="image/jpeg" rel="enclosure" href="http://ih0.redbubble.net/path-to-the-original-image"/>
<category term="1"/>
<category term="2"/>
</entry>
Basically using regex... how would I go about getting the href
property inside the link in the content tag?
One thing we know for sure is it will always have configure in the path i.e. http://somesite.com/**configure**/id
So basically I just need to find the URL with configure in and grab the whole thing...
Upvotes: 1
Views: 2626
Reputation: 5741
Thanks for your awesome answers but my colleague solved it for me!
This is what i ended up using:
/http:\/\/([^"\/]*\/)*configure\/([^"]*)/
(Ruby regex by the way)
Upvotes: 1
Reputation: 93006
If you have to use regex try this one:
href="(?=[^"]*configure)([^"]*)
I am using a lookahead to find if it contains configure.
Upvotes: 1
Reputation: 359906
Whatever programming language you're using, don't try to parse the whole thing with a regex. Use an XML parser first to extract the href="..."
. Then, sure, use a regex to make sure the URL contains configure
.
As @KARASZI commented, XPath is another good approach.
Upvotes: 1
Reputation: 2674
The following regex will extract the href content based on your requirements. It seems to work for the sample code.
href="(\w[^"]+/configure/\w[^"]+)
Upvotes: 2